SRE Best Practices

Sponsored Post

SRE Best Practices

Site Reliability Engineering (SRE) is a practice that emerged at Google because of its need for highly reliable and scalable systems. SRE unifies operations and development teams and implements DevOps principles to ensure system reliability, scalability, and performance. 

There’s plenty of documentation on tactics for adopting automation and implementing infrastructure as code, but practical ops-focused SRE best practices based on real-world experience are harder to find. This article will explore 6 SRE best practices based on feedback from SREs and technical subject matter experts. Here is a list of topics we will cover.

SRE Best Practice Benefit
Define the role of the SRE Removes role ambiguity and clarifies responsibilities.
Automate toil and make time for strategic tasks Emphasizes automation of simple tasks and enables humans to focus on more complex work.
Monitor using SLIs and SLOs Imrpoves visibility and helps determine if SLAs are met.
Maintain a transparent status page Summarizes infrastructure performance and availability for all stakeholders.
Categorize incident severities Helps quantify incident impact and prioritize incident management tasks.
Conduct post-mortems and share them publicly Encourages transparency and continuous learning.

SRE Best Practices

The six SRE best practices below are based on feedback from experienced SREs and focus on the operational side of site reliability engineering. 

Define the role of the SRE

The Site Reliability Engineer, also called SRE; has several responsibilities:

  • Designing systems to monitor, automate, and achieve the highest uptime with the lowest operational effort
  • Enabling developers to iterate and move fast simultaneously
  • Incident management
  • Performing root cause analysis (RCA)
  • Conducting post-mortems (more on these later in the article)
  • Creating documentation to minimize tribal knowledge

SREs should spend most of their time automating their tasks to avoid having to be constantly working “toil”. Toil is a catchall term for operational tasks that involve repetitive manual configuration or lack long-term strategic value. Without automation, toil requires the engineering team’s time. With automation, engineers can focus on more complex tasks. 

This diagram represents the optimal time allocation for an SRE engineer. 

The two categories of SRE work. (Source)

Automate toil and leave time for strategic tasks

To avoid wasting valuable engineering time, SREs should work on automating every repetitive task, so teams focus less on toil and more on innovation. SREs use scripts, programs, and frameworks to automate and monitor those tasks. 

Within high-performing teams, eliminating toil is a core SRE function. From a tactical perspective, there are many ways to implement this best practice. The key is to limit wasting human time spent working on simple things automation can handle. 

Monitor using SLIs and SLOs

Effective monitoring is a crucial part of SRE. Metrics should be as close to the user as possible since most businesses care more about user experience. Organizations should define their most important metrics. Then, SREs use these metrics to build the three key indicators: SLIs, SLOs, and SLAs.

Key Indicator Goal Stakeholders
SLIs (Service Level Indicators) Collect metrics in a standardized way to gain insights into the system's performance Development and product team
SLOs (Service Level Objectives) Set the uptime objective for the company Development team, product team, and company executives
SLAs (Service Level Agreement) Set the expectations for the general public about the reliability of your services Clients, consumers, and the general public

SLIs: Service Level Indicators

SLIs are used to collect metrics in standardized ways. Here is a breakdown of common SLI types. 

Type of SLI Description
Availability Percentage of requests that resulted in a successful response.
Latency Percentage of requests that returned faster than the minimum threshold.
Quality Percentage of requests that were served in a non-optimal manner due to service affectation.
Freshness Percentage of data that was successfully updated under the minimum threshold.

SLOs: Service Level Objectives

SLOs are the goals the organization must accomplish and are formulated using the service level indicators (SLI) explained in the previous section. These should be published internally in a place easily accessible for technical and non-technical stakeholders.

SLAs: Service Level Agreements

These are contracts with clients/consumers about what to expect from the service, usually legally bonding and with monetary implications if not met.

Maintain a transparent status page

Customers need to understand the system's status at all times. If there's an outage on a system, the customer has to know about it as soon as possible. That helps build trust and prevents them from troubleshooting an issue they cannot control.

Status pages reflect the status of services in real-time. They should be clear and concise and have a color-coded indicator for each service exposed to customers. In case of failure, it should immediately report which services are failing and why. It is always great to accompany it with an email or RSS notification. 

The Squadcast status page at https://status.squadcast.com.

Categorize incident severities

With enough time and complexity, errors happen. When they do, they must be addressed in an organized manner.

Incidents have different severities: generally, those are P0, P1, P2, and P3. Severity determines the action to be taken and response time.

Severity Examples Action Response time
P0 (Critical) The site is unavailable for one or several reasons: DDoS attack, wrong configuration, bad deployment, or third-party incident). It can also be related to a security issue, such as PII exposure. Page (push to on-call, call to action, email, Slack, War Room). Most of the time, several teams are involved, with multiple stakeholders. Engineers do RCA in real-time. Immediate (within 5 minutes)
P1 (Major) The site is partially affected due to one or more services failing or a provider incident. This affection could also be intermittent. Page (push to on-call, email). It usually involves fewer teams than a P0, but it has to be solved relatively fast to prevent a deteriorated user experience. Page (push to on-call, email). It usually involves fewer teams than a P0, but it has to be solved relatively fast to prevent a deteriorated user experience. Fast (within 20-30 minutes)
P2 (Minor) Some of the site’s non-critical functionalities are affected, like recommendations not loading correctly, some images not showing up, or loading too slowly. Slack, email, and notify a single team. See if there is easy remediation and if there is a fix to be applied. If not, sometimes it could be delayed until the next working day. An item should be placed in the backlog and prioritized accordingly. Standard (within a few days)
P3 (Irrelevant/Bug) The incident is not affecting users directly, or users may not even be aware of it, like an elevated error rate, which the client applications retries on. Notification channels may include email or Slack, but do not require an immediate response. SREs should review during working hours. Slow (within a few days or weeks)

Conduct post-mortems and share them publicly 

Shortly after an incident, SREs should do two things:

  • Address the issues: Critical errors are patched or hot-fixed in an improvised way, and it is usually not a permanent solution. If that is the case, those should be placed in the backlog to be revisited by the development teams. SREs should also review issues not fixed on-call during working hours.
  • Draft a post-mortem: Post-mortems are a briefing of what happened during the incident. These help us get all the information from the incident, what happened, why, and how to prevent them in the future. Thus, every post-mortem should have clear documentation and action items placed in a backlog and prioritized according to the severity.

It is also a great idea to share post-mortems with the public since it helps bring transparency and strengthen their trust.