SRE Best Practices
Site Reliability Engineering (SRE) is a practice that emerged at Google because of its need for highly reliable and scalable systems. SRE unifies operations and development teams and implements DevOps principles to ensure system reliability, scalability, and performance.
There’s plenty of documentation on tactics for adopting automation and implementing infrastructure as code, but practical ops-focused SRE best practices based on real-world experience are harder to find. This article will explore 6 SRE best practices based on feedback from SREs and technical subject matter experts. Here is a list of topics we will cover.
SRE Best Practice | Benefit |
Define the role of the SRE | Removes role ambiguity and clarifies responsibilities. |
Automate toil and make time for strategic tasks | Emphasizes automation of simple tasks and enables humans to focus on more complex work. |
Monitor using SLIs and SLOs | Imrpoves visibility and helps determine if SLAs are met. |
Maintain a transparent status page | Summarizes infrastructure performance and availability for all stakeholders. |
Categorize incident severities | Helps quantify incident impact and prioritize incident management tasks. |
Conduct post-mortems and share them publicly | Encourages transparency and continuous learning. |
SRE Best Practices
The six SRE best practices below are based on feedback from experienced SREs and focus on the operational side of site reliability engineering.
Define the role of the SRE
The Site Reliability Engineer, also called SRE; has several responsibilities:
- Designing systems to monitor, automate, and achieve the highest uptime with the lowest operational effort
- Enabling developers to iterate and move fast simultaneously
- Incident management
- Performing root cause analysis (RCA)
- Conducting post-mortems (more on these later in the article)
- Creating documentation to minimize tribal knowledge
SREs should spend most of their time automating their tasks to avoid having to be constantly working “toil”. Toil is a catchall term for operational tasks that involve repetitive manual configuration or lack long-term strategic value. Without automation, toil requires the engineering team’s time. With automation, engineers can focus on more complex tasks.
This diagram represents the optimal time allocation for an SRE engineer.
Automate toil and leave time for strategic tasks
To avoid wasting valuable engineering time, SREs should work on automating every repetitive task, so teams focus less on toil and more on innovation. SREs use scripts, programs, and frameworks to automate and monitor those tasks.
Within high-performing teams, eliminating toil is a core SRE function. From a tactical perspective, there are many ways to implement this best practice. The key is to limit wasting human time spent working on simple things automation can handle.
Monitor using SLIs and SLOs
Effective monitoring is a crucial part of SRE. Metrics should be as close to the user as possible since most businesses care more about user experience. Organizations should define their most important metrics. Then, SREs use these metrics to build the three key indicators: SLIs, SLOs, and SLAs.
Key Indicator | Goal | Stakeholders |
SLIs (Service Level Indicators) | Collect metrics in a standardized way to gain insights into the system's performance | Development and product team |
SLOs (Service Level Objectives) | Set the uptime objective for the company | Development team, product team, and company executives |
SLAs (Service Level Agreement) | Set the expectations for the general public about the reliability of your services | Clients, consumers, and the general public |
SLIs: Service Level Indicators
SLIs are used to collect metrics in standardized ways. Here is a breakdown of common SLI types.
Type of SLI | Description |
Availability | Percentage of requests that resulted in a successful response. |
Latency | Percentage of requests that returned faster than the minimum threshold. |
Quality | Percentage of requests that were served in a non-optimal manner due to service affectation. |
Freshness | Percentage of data that was successfully updated under the minimum threshold. |
SLOs: Service Level Objectives
SLOs are the goals the organization must accomplish and are formulated using the service level indicators (SLI) explained in the previous section. These should be published internally in a place easily accessible for technical and non-technical stakeholders.
SLAs: Service Level Agreements
These are contracts with clients/consumers about what to expect from the service, usually legally bonding and with monetary implications if not met.
Maintain a transparent status page
Customers need to understand the system's status at all times. If there's an outage on a system, the customer has to know about it as soon as possible. That helps build trust and prevents them from troubleshooting an issue they cannot control.
Status pages reflect the status of services in real-time. They should be clear and concise and have a color-coded indicator for each service exposed to customers. In case of failure, it should immediately report which services are failing and why. It is always great to accompany it with an email or RSS notification.
Categorize incident severities
With enough time and complexity, errors happen. When they do, they must be addressed in an organized manner.
Incidents have different severities: generally, those are P0, P1, P2, and P3. Severity determines the action to be taken and response time.
Severity | Examples | Action | Response time |
P0 (Critical) | The site is unavailable for one or several reasons: DDoS attack, wrong configuration, bad deployment, or third-party incident). It can also be related to a security issue, such as PII exposure. | Page (push to on-call, call to action, email, Slack, War Room). Most of the time, several teams are involved, with multiple stakeholders. Engineers do RCA in real-time. | Immediate (within 5 minutes) |
P1 (Major) | The site is partially affected due to one or more services failing or a provider incident. This affection could also be intermittent. Page (push to on-call, email). It usually involves fewer teams than a P0, but it has to be solved relatively fast to prevent a deteriorated user experience. | Page (push to on-call, email). It usually involves fewer teams than a P0, but it has to be solved relatively fast to prevent a deteriorated user experience. | Fast (within 20-30 minutes) |
P2 (Minor) | Some of the site’s non-critical functionalities are affected, like recommendations not loading correctly, some images not showing up, or loading too slowly. | Slack, email, and notify a single team. See if there is easy remediation and if there is a fix to be applied. If not, sometimes it could be delayed until the next working day. An item should be placed in the backlog and prioritized accordingly. | Standard (within a few days) |
P3 (Irrelevant/Bug) | The incident is not affecting users directly, or users may not even be aware of it, like an elevated error rate, which the client applications retries on. | Notification channels may include email or Slack, but do not require an immediate response. SREs should review during working hours. | Slow (within a few days or weeks) |
Conduct post-mortems and share them publicly
Shortly after an incident, SREs should do two things:
- Address the issues: Critical errors are patched or hot-fixed in an improvised way, and it is usually not a permanent solution. If that is the case, those should be placed in the backlog to be revisited by the development teams. SREs should also review issues not fixed on-call during working hours.
- Draft a post-mortem: Post-mortems are a briefing of what happened during the incident. These help us get all the information from the incident, what happened, why, and how to prevent them in the future. Thus, every post-mortem should have clear documentation and action items placed in a backlog and prioritized according to the severity.
It is also a great idea to share post-mortems with the public since it helps bring transparency and strengthen their trust.