I’m the SRE at StackPulse, and in this role, I’m a customer, a product manager, and an evangelist — all in one. I’m often our first user of a new feature, and given my experience, I’m always opinionated about how a feature should work and the benefit it should provide. But, more importantly, I help our teams understand the ‘why’ behind site reliability engineering. We live in a customer-driven world, and if we consume software, we expect it to be reliable, available, and innovative. To make this happen, we have to move away from thinking about software and services as ‘servers and services’ and start thinking about them as what they are— your medical records, your banking information. That change is something I evangelize constantly to our team.
In order to accelerate the adoption of this new mentality, the following are six key beliefs and practices that I employ.
Focus on the Big Picture
There are a lot of SRE mantras I preach: “9s don’t matter when your users aren't happy” comes to mind. I’m constantly working with the team to zoom out and taket a holistic view of the system to understand the context and impact, instead of focusing on a specific log or alert. Some of this is working to reduce unneeded alerts and the fatigue they cause, but it’s also about a mindset change from logging and monitoring to observability.
Microservices Require Observability
Microservices brought new challenges both in terms of organizational structure and the way we deploy and monitor our myriad of services. Knowing at all times (even backwards) what your service is doing and where, and what it’s communicating with are crucial parts of understanding complex systems and how they can fail. TLDR: Observability infrastructure tooling is a critical foundation for every company running a complex architecture.
SRE Tools vs. SRE Culture
Of course, there are go-to SRE tools that we use here at StackPulse (eg. IDEs, APMs, etc.); however, it is our team and our culture that defines what we do. We have built a team of varied backgrounds in development, product, marketing, sales, etc., and turned ourselves into champions of site reliability engineering. Going back to the change from thinking about software services as their components to what they represent to users, we have built and continue to build a platform that helps keep that user experience reliable. This process requires internally promoting the right principles and culture across our teams to avoid burnout as well as outages. Ten years ago, you had to pick. Our premise is that you no longer have to.
Dogfooding: StackPulse Style
If we have an incident, we write a code-based playbook for it. That then goes into our playbook repo, because if we had this problem, it follows that other customers likely did as well. Before we built our Incident Hub into the StackPulse platform, we were running our retrospectives in Notion. So all the incident data we collected in StackPulse we then exported out for later retrospection. When it came time to build that feature, I knew what would benefit our processes, and worked with the team to establish hypotheses we later used to build the Incident Management capabilities that we released to customers last year.
How to Use Automation
Ultimately, the goal of automation is to optimize our services, improve the lives of the team, and reduce human error. For us, that means using StackPulse itself, and moving away from documented runbooks and into code-based playbooks. Documentation can go stale, or be incorrect, or not be easily understood by the whole team. Code, on the other hand, just gets executed. It’s a lot easier.
SRE Isn’t Just for Engineering
It’s easy to think about SRE as a role focused on engineering and operations, but I work with our sales and marketing teams as well to bring as much as we can into an SRE culture that puts a developer-driven reliability philosophy first. For example, our Salesforce.com instance is configured via code. Our marketing analytics is piped into the same BI tooling as our product analytics, and has been since before we had a marketing website. We invested the time early to engineer the right framework that enables the business and allows us to scale.
StackPulse offers a complete, well-integrated solution for managing reliability — including automated alert triggers, playbooks, and documentation helpers. Get a free trial to see what we have to offer.