Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

A Day in the Life: Intelligent Observability at Work with a Super SRE

After we’d fixed Aparna’s network issue, James came to see me at my desk. Masks on, socially distanced and all that, but it was nice to have some face-to-face time. James is cool – that dry British humor and not your classic IT Ops dude. He’s been here forever and mentored me when the CIO, Charlie, hired me as the first SRE here a year or so ago. I lucked out really.

How to Scale for Reliability and Trust

As more people depend on your product, reliability expectations tend to grow. For a service to continue succeeding, it has to be one customers can rely upon. At the same time, as you bring on more customers, the technical demands put on your service increase as well. Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency.

How to Analyze Contributing Factors Blamelessly

SRE advocates addressing problems blamelessly. When something goes wrong, don’t try to determine who is at fault. Instead, look for systemic causes. Adopting this approach has many benefits, from the practical to the cultural. Your system will become more resilient as you learn from each failure. Your team will also feel safer when they don’t fear blame, leading to more initiative and innovation. Learning everything you can from incidents is a challenge.

It's all Chaos! And it Makes for Resilience at Scale

Chaos engineering is a practice where engineers simulate failure to see how systems respond. This helps teams proactively identify and fix preventable issues. It also helps teams prepare responses to the types of issues they cannot prevent, such as sudden hardware failure. The goal of chaos engineering is to improve the reliability and resilience of a system. As such, it is an essential part of a mature SRE solution.

How to Build an SRE Team with a Growth Mindset

The biggest benefit of SRE isn’t always the processes or tools, but the cultural shift. Building a blameless culture can profoundly change how your organization functions. Your SRE team should be your champions for cultural development. To drive change, SREs need to embody a growth mindset. They need to believe that their own abilities and perspectives can always grow, and encourage this mindset across the organization.

How We Built and Use Runbook Documentation at Blameless

Even if you don’t notice, you are executing runbooks everyday, all the time. When you have an incident in your day-to-day operations, you follow a series of ordered and connected steps to solve it. For instance, if you lose your internet connection, you will follow a series of steps to resolve that issue: This could be different depending on your method, but you have the idea.

SRE as Organizational Transformation: Lessons from Activist Organizers

In the software industry’s recent past, the biggest disruptive wave was Agile methodologies. While Site Reliability Engineering is still early in its adoption, those of us who experienced the disruptive transformation of Agile see the writing on the wall: SRE will impact everyone. Any kind of major transformation like this requires a change in culture, which is a catch-all term for changing people’s principles and behaviors.