Major outages are bound to occur in even the most well-maintained infrastructure and systems. Being able to quickly classify the severity level also allows your on-call team to respond more effectively.
Imagine a scenario where your on-call team is getting critical alerts every 15 minutes, user complaints are piling up on social media, and since your platform is inoperative revenue losses are mounting every minute. How do you go about getting your application back on track? This is where understanding incident severity and priority can be invaluable. In this blog we look at severity levels and how they can improve your incident response process.
Severity and Priority: How Are They Different?
In most cases, the impact on the end user is a measure of the severity of an incident. Information about the error that is coming directly from the monitoring tool helps in classifying the severity level. Every organization will have defined levels of severity and procedures that work well for them. To get started with defining severity levels of incidents, we must first understand how to categorize them.
You should ask two major questions:
- Are major workflows now affected?
- Does it interfere with a user’s ability to complete an essential task?
Identifying the most crucial workflows of your apps or services is one of the first steps in defining severity levels. It aids in the identification of what defines an occurrence. Using "SEV" criteria, we may classify incidents according to their severity. Major incidents are classified with lower SEV ratings and require rapid response.
Every company must understand their own business, team and the kind of SEV-level descriptions that operate best for them. As we move further, we have a table that you may use to define severity levels for your organization.
It may appear as if incident severity and priority are one and the same. Isn't it reasonable to prioritize dealing with a catastrophic event over a minor one? In reality, it's more complicated than that for most businesses.
Once information about the error has been received, the incident commander will assign a level of priority to the incident. It could be P1 (priority level 1) for issues that need to be fixed at the earliest. Severity talks about impact on the user, and priority is the order in which the on-call engineers will work on the issues affecting the infrastructure.
For example, on an e-commerce platform, if the customers are not able to check out their shopping cart, this is an example of a severe issue. In this specific case, it is a high-priority incident as well. On the other hand, if there is a typo in the brand logo or the font size is too large, it is a high-priority incident without being a high-severity incident. Customers can still continue to shop on the website.
Let us consider another example, there is an event that causes your app to crash because it prevents users from doing what they need to do. It has a high severity rating. That incident affects only .01 percent of your users. However, it may not be considered a higher priority if there are other incidents that are affecting a greater number of users.
It's important to know when the two measurements are aligned. There are also situations when they might not be aligned. When something is given a high priority, it doesn't necessarily follow that it is of high severity.
Defining Severity Levels for Your Organization
Not all situations are the same, and not all companies manage them in the same manner. In addition to the consequences of an event, you'll need to consider the following when establishing severity levels and the procedures and expectations that go with them.
A reliability platform like Squadcast and an e-commerce platform will have different ways of defining severity. As each of these has users with different requirements and tolerance levels, it is critical to first understand what the user expectations are.
How to Determine Severity Levels?
One must take into consideration the following before deciding on severity levels:
High and low traffic periods for your service
At certain times of the week, your customer traffic may be low. If an incident occurs at that time, few of your users will be affected. For example, if the shopping cart of an e-commerce site is not functional for certain hours of the day when the traffic is comparatively low, not many users will be affected.
The architecture of your infrastructure
You may be using a microservice-based architecture that has multiple redundancies and can easily scale up with higher user load. In such a scenario, the failure of one component will not be considered a high-severity incident as it can be easily replaced with a redundant service. For example, if the authentication service goes out, which sometimes cannot be easily replicated, it automatically becomes a high-severity incident since even if the other components are working fine, your users won't be able to use the product.
Using SLOs to determine severity levels
Since each service has its own specific service-level objective, which determines its functionality, we can use it to determine the severity level. For example, if a particular service’s SLO is transaction rate, if the number of successful transactions goes below a certain threshold, we can classify it as a high-severity incident.
Levels of Severity
Severity definitions are organization-specific. An incident that is classified as SEV-1 may have a lower severity rating in another organization. There are also instances where certain organizations have just three levels of severity. The general rule that is followed is that the more user journeys/workflows that are affected by the incident, higher will be the severity level.
Some organizations may also categorize severity levels on the basis of SLIs (service-level indicators) or SLOs (service-level objectives ) being affected. The table below lists one of many possible ways to define severity levels.
|SEV-1||Usually incidents are considered to be SEV-1 if large-scale failures in your infrastructure are occuring that negatively affects most users. Critical services are disrupted or unavailable. Database read/write errors, security breaches and other issues might fall under this umbrella term.
If third-party services (such as Google SSO) are down, users may be unable to sign in, is often considered a level 1 severity issue.
|SEV-2||Usually a SEV-2 incident is declared when user experience is severely affected. This can include unacceptably high levels of latency, or a significant breach of SLAs/SLOs. These kinds of incidents have the potential to cause major revenue loss for your organization. Any incident that affects more than 70 percent of the users can be classified as SEV-2.|
|SEV-3||An occurrence that has just a minimal impact on the infrastructure but nonetheless creates high load or latency issues for your users. This can include unacceptable long website load times, timeouts for shopping carts and other similar issues.|
|SEV-4||This is an issue that affects customer experience, but doesn't have a major impact on the service's operation. This can include inconsistent load times of pages, display problems in different browsers and similar issues.|
|SEV-5||Low-level mistakes, such as formatting or display issues that do not impair usability are classified as SEV 5. This can include typos in product descriptions, incorrect colors being displayed in brand logos and other issues of that nature.|
It is essential to properly classify incident severity levels to get a head start on solving infrastructure issues. Working with previously defined severity levels helps on-call teams to quickly triage major issues. As we have seen in this blog, each organization will have their own specific way of deciding upon the severity and priority of incidents.
As the nature and scale of your infrastructure grows and the needs of your user base evolve over time, you may want to revisit and modify the definitions of severity levels. Continuous learning is an essential part of good incident response. We hope this blog is helpful for you in setting the path for better incident response in your organization.
Squadcast is an incident management tool that’s purpose-built for site reliability engineering. Get rid of unwanted alerts, receive relevant notifications and integrate with popular ChatOps tools. Work in collaboration using virtual incident war rooms and use automation to eliminate toil.