The State of Incidents and Site Reliability: Q&A with Blameless SRE Architect Kurt Andersen

Featured Post

The State of Incidents and Site Reliability: Q&A with Blameless SRE Architect Kurt Andersen

In the latest of an occasional series, today we hear from Kurt Andersen, SRE Architect at Blameless, discussing the evolution of incident management, current trends in site reliability affecting engineering teams, as well as an update on how Blameless is addressing the needs of SRE and DevOps.

What are the biggest challenges companies are facing regarding incidents?

We continue to see many organizations and specifically senior leadership who don’t fully understand why incidents are inevitable. Without viewing incidents as an opportunity to learn, it becomes difficult for teams to gain both tribal and system-level knowledge, continuously.

Even with the best preventative, proactive engineering, today’s highly interdependent, complex landscape means that your product or service will have outages and incidents of varying kinds and severity. This provides fertile ground for teams and organizations to learn and adapt but only if the culture is willing to embrace the learning.

The other big challenge that groups face when trying to learn from incidents is the difficulty of collecting the relevant data across a wide span of different tools. Effectively capturing all the moving parts across the responding teams really helps to analyze, post-incident.

Communication is critical during and post-incident. At Blameless we introduced CommsFlow to keep stakeholders updated as the reliability of services and applications changes. CommsFlow makes incident communication systematic and unburdens on-call teams, removing any distractions so they focus on fixing the issue at hand.

For us, communication is paramount and must be timely and relevant with the ultimate goal to streamline all communication during and after each incident. Of course, various internal teams need to know what’s happening as the incident status changes, and more importantly, customers and partners need to know. With the right tools, customizable templates, and integrated with messaging tools, updates are automatically sent at workflow transitions or as reminders for on-call teams.

What is the number one thing that makes companies vulnerable to incidents?

There is no one thing that makes companies vulnerable to incidents, but rather many, many things. With complex systems and especially cloud-native along with what Jeff Bezos calls the “divine discontent” of customers, come multiple opportunities for incidents. We often refer to root cause analysis when troubleshooting an issue. However, it’s misleading because there is generally no single root cause. 

Even if your product or service perfectly fulfills the needs of your users at this moment, those needs will be different with every passing moment. David Woods expresses this as the Law of Stretched Systems where “every system is stretched to operate at its capacity. . .as soon as there is some improvement, some new technology, we exploit it to achieve a new intensity and a new tempo of activity” (Woods and Hollnagel, 2006, Joint Cognitive Systems: Patterns in Cognitive Systems Engineering, p 171)

What can companies do to embrace the blameless culture mindset?

The biggest cultural and psychological barrier to learning is blame. To overcome this, it is important to decouple actions and their consequences from people’s existential being. If a team makes a mistake, it does not mean that the team is a mistake. Some companies have implemented “Failure awards,” such as Etsy’s 3-armed sweater award. This helps to shift the attitude about failures to see them as learning opportunities.

What should SREs be doing to protect their company?

The keys to reliability include:

  • Effective monitoring and observability, because you need to measure, track, and iterate as the system continually grows and changes
  • Implement good architectural practices, to scale smoothly, over time
  • Develop well-crafted customer-centric service level objectives (SLOs), to stay focused on the most important parts of your service
  • Have effective clear, concise processes for incident response and take time to provide feedback on the learnings to improve services and systems reliability plus up level all team members.

Give us an update on Blameless and its work with SRE and DevOps?

We generally start working with the DevOps teams inside an organization, which usually starts with teams who are on-call. Most modern engineering teams rotate on-call across a unified group - so it’s not only operations. If a subject matter expert (SME) is required, those individuals are easily added to a Slack or MS teams’ channel, once you identify the right individuals. Blameless makes this process highly efficient and seamless. We auto-create the slack channel for the incident and all relevant team members are added so they quickly start incident troubleshooting. This requires focus time from engineers. Blameless orchestrates the entire process, collecting critical data along the path to resolution. The key differentiating value here is to integrate with the relied-upon tools used by the team that spans alerting, messaging, video conferencing, APM, Ticketing, Observability plus Analytics because you need to stitch all those tools together with a golden thread of data to make better sense of what’s occurring.

Often organizations want to improve the efficiency of their on-call process from beginning to end and most importantly:

  1. Get away from manual work or point solutions: On-call engineers don’t want to spend time stitching together the playbook or steps in a manual way. By using multiple tools that don’t talk to each other, there’s no single thread or centralized place to manage the incident to completion.
  2. The goal is to streamline who is involved when triaging an incident. Invite the right team members at the right time. Don’t wake up everyone or irrelevant team members; it's simply not necessary and causes toil and burn-out.
  3. Teams learn from incidents and everyone up levels plus systems improve - reliability gets better over time - and the entire team focuses on innovation by delivering new releases. It’s critical to close out incidents in the right way and document exactly what happened.
  4. Finally, data insights are critical for not only helping the team learn but the entire company to stay informed as service status changes. Product teams learn how reliable the service is, especially for critical or new features. Go to market teams and e-staff need insights to make top-line decisions and of course, end-customers need to know as they continue to invest or as the business expands into new markets or add-ons to an existing product, critical for growth.

What are some site reliability trends you’re seeing for the rest of 2022?

Every year we take a step back and look at the market big picture to spot trends we see take place. Often, we experience new trends first-hand as our teams work directly with organizations, both small and large across multiple sectors.

Our process: we conduct internal meetings with our internal engineering / SRE teams and with our customer success team-members. Additionally, we talk to industry analysts and thought leaders - we like to stay ahead of market developments and requests from our customers.

But as far as trends in site reliability they include but are not limited to the following:

  1. A growing sense of urgency to be more proactive when it comes to dealing with incidents and outages. Given the various headwinds like the pandemic, businesses are more motivated to invest in reliability capabilities with even greater urgency.
  2. Broadening scope for SREs. If SREs are currently like mechanics, fixing cars when they crash, SREs will become more like civil engineers, focusing more on designing the roads for cars.
  3. A holistic understanding of users. Organizations will have a deeper, more holistic understanding of who their users are.
  4. Finding the true potential of SLOs. As more and more orgs start to envision the full potential of the SLO, they’ll also come to value other aspects of SRE that support and inform the SLO.

You say that Blameless is the "backbone for modern engineering teams." Why?

We use the metaphor of a backbone because it’s the centralized place where all team members communicate, at multiple runbook steps, conducting tasks and follow-up actions as they debug the problem. Blameless is the platform that is orchestrating all the way through each step by integrating with day-to-day tools DevOps teams use - PagerDuty, Jira, StatusPage, MS Teams, Slack, Opsgenie, and so on. It’s that golden thread that is running through all critical steps from beginning to end.

While the on-call team is doing the deep thinking, assessing, and fixing the problem, blameless runs in the background continuously updating, communicating at the critical milestone steps.

How does Blameless look at reliability?

We really believe Reliability starts with an agreed mindset and discipline which extends beyond just the engineering team.

The company as a whole has to believe that when its customers expect a certain level of reliability, it is directly correlated to their investment plus loyalty and so then it requires the company to take it seriously and invest that crosses multiple functions of the business. From product to engineering to support and success.

Operations teams shouldn’t be left with the burden of owning reliability … hamster wheeling their way through yet another issue or incident. Of course, not all incidents are severe or cause a total outage, but a bad experience is not ideal for the end customer, regardless. Many inferior experiences can actually be worse than one sev 1 or outage.

We tend to break down the reliability journey like this:

  1. Incident management which is the entire process flow and data from beginning to end with a detailed retrospective report for learnings.
  2. SLOs are critical to align on. It’s what the team believes is the most critical part of the service that customers rely on. By focusing on those KPIs (so to speak), it streamlines the team’s focus and helps to avoid treating everything with equal weight which let’s face it is unscalable and unrealistic.
  3. Finally, two equally important aspects of reliability are the culture across all teams focused on maintaining these objectives and that is adopting a blameless culture mindset. It’s about all teams working together and lifting together plus learning and growing. We believe it’s the only way forward.
  4. Lastly, by analyzing how teams are performing, everyone improves and learns. You need the right data sets to inform where to improve and where to invest going forward.

What do companies get wrong about reliability?

Reliability is not only about incident response. It’s a business imperative. If your service is unreliable, customers will not adopt (or buy) and over time it’s simply bad for the brand and will degrade and damage it over time. Markets are too competitive, and customers have high expectations. The cost of switching in many cases is low so it’s very easy to lose customers to the next best offering.

The other thing that companies tend to get wrong is that it really is a journey and it’s never a quick fix and move on. It’s constantly evolving and changing as the business grows. For example, new team members are onboarded plus as the services or products change, new SLOs will need adjusting. It’s a continuous area that companies need to focus on and invest in as the reliability program matures.

Finally, I would add that incidents are a way of learning and improving and so embracing how you do that is critical. Never expect incidents to go away. They may change shape, but they will always take place.

 

Thanks to Kurt for taking the time to share his thoughts with the OpsMatters community this week. You can learn more about how Blameless enables management of incidents in real-time here.