SRE

The latest News and Information on Service Reliability Engineering and related technologies.

The Business Case for Observability and Site Reliability Engineering

Unlike traditional IT Ops, the role of the SRE isn’t simply focused on finding and solving technical problems. The big win for today’s SREs is supporting the organization’s strategic innovation initiatives. With the appropriate observability capabilities, it’s possible to quantify the value that software infrastructure contributes to this innovation effort.

Implementing SRE at the largest online retailer of NL and Belgium w/ Bart Enkelaar (bol.com) | EP #5

For the fifth episode of the StackPod, we invited Bart Enkelaar. Bart is a lead SRE at the largest online retailing platform in the Netherlands and Belgium: bol.com. He's been a backend engineer for 13 years and is now responsible for setting up site reliability engineering across more than a hundred DevOps teams. In this episode, Bart and Anthony talk about.

Canary Deployments | The Benefits of an Iterative Approach

At Blameless, we want to embrace all the benefits of the SRE best practices we preach. We’re proud to announce that we’ve started using a new system of feature flagging with canaried and iterative rollouts. This is a system where new releases are broken down and flagged based on the features each part of the release implements. Then, an increasing subset of users are given access to an increasing number of features.

The Importance of Observability for the SRE

The term Site Reliability Engineer (SRE) first appeared in Google in the early 2000s. In Google’s 2016 SRE Book, Benjamin Treynor Sloss wrote that, generally speaking, “an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).” This means that the SRE teams at Google decide how a system should run in production as well as how to make it run that way.

SRE and the Practice of Practice

Part of the trepidation of being on-call is encountering unfamiliar emergency scenarios where we are surprised by suddenly not knowing how to do our jobs. We feel lost and alone, complicated by the world around us, powerless to resolve or even mitigate the problem. On-call need not be a solo affair full of fear and anxiety. There are ways we can employ practice and open collaboration outside of incidents to prepare us better.

The Universal Language: Reliability for Non-Engineering Teams

We talk about reliability a lot from the context of software engineering. We ask questions about service availability, or how important it is for specific users. But when organizations face outages, it becomes immediately obvious that the reliability of an online service or application is something that impacts the entire business with significant costs. A mindset of putting reliability first is a business imperative that all teams should share.