Building a successful monitoring process for your application is essential for high availability. In the first of this three-part blog series, Safeer discusses the four key SRE Golden Signals for metrics-driven measurement, and the role it plays in the overall context of Monitoring.
Monitoring is the cornerstone of operating any software system or application effectively. The more visibility you have into the software and hardware systems, the better you are at serving your customers. It tells you whether you are on the right track and, if not, by how much you are missing the mark.
This is the first article in the 3 Part blog series covering SRE Golden Signals.
So what should we expect from a monitoring system? Most of the monitoring concepts that apply to information systems apply to other projects and systems as well. Any monitoring system should be able to collect information about the system under monitoring, analyze and/or process it, and then share the derived data in a way that makes sense for the operators and consumers of the system.
The meaningful information that we are trying to gather from the system is called signals. The focus should always be to gather signals relevant to the system. But just like any radio communication technology that we are drawing this terminology from, noise will interfere with signals. Noise being the unwanted and often irrelevant information that is gathered as a side effect of monitoring.
Traditional monitoring has been built around active and passive checks and the use of near-real-time metrics. The good old Nagios and RRDTools worked this way. Monitoring gradually matured to favor metrics-based monitoring, and that gave rise to popular platforms like Prometheus and Grafana.
Centralized Log analysis and deriving metrics from logs became mainstream - the ELK stack was at the forefront of this change. But the focus is now shifting to traces and the term monitoring is being replaced by observability. Beyond this, we also have all the APM (Application Performance Monitoring) and Synthetic monitoring vendors offering various levels of observability and control.
All these platforms provide you with the tools to monitor anything, but they don’t tell you what to monitor. So how do we choose the relevant metrics from all this clutter and confusion? The crowded landscape of monitoring and observability makes the job harder, not to mention the efforts needed to identify the right metrics and separate noise from the signal. When things get complicated, one way to find a solution is to reason from first principles. We need to deconstruct the problem and identify the fundamentals and build on that. In this specific context, that would be to identify what is the absolute minimum that we need to monitor and then build a strategy on that. So on that note, let’s understand the popular strategy used to choose the right metrics.
SRE Golden Signals
SRE Golden signals were first introduced in the Google SRE book - defining it as the basic minimum metrics required to monitor any service. This model was about thinking of metrics from first principles and serves as a foundation for building monitoring around applications. The strategy is simple - for any system, monitor at least these four metrics - Latency, Traffic, Errors, and Saturation.
Latency is the time taken to serve a request. While the definition seems simple enough, latency has to be measured from the perspective of a client or server application. For an application that serves a web request, the latency it can measure is - the time delta between the moment the application receives the first byte of request, to the moment the last byte of the response to this request leaves the application. This would include the time the application took to process the request and build the response and everything in between - which could include disk seek latencies, downstream database queries, time spent in the CPU queue, etc. Things get a little more complicated when measuring latency from the client perspective because now the network between the client and server also influences the latency. The client could be of two types - the first is another upstream service within your infrastructure, and the second - and more complex - is real users sitting somewhere on the internet and there is no way of ensuring an always stable network between it and the server. For the first kind, you are in control and measure the latencies from the upstream application. For internet users, employ synthetic monitoring or Real User Monitoring (RUM) to get an approximation of latencies. These measurements get overly complicated when there is an array of firewalls, load balancers, and reverse proxies between the client and the server.
There are certain things to keep in mind when measuring latencies. The first is to identify and segregate the good latency and the bad latency, the latencies endured by a successful request versus failed request. Quoting from the SRE Book, an HTTP 500 error latency should be measured as bad latency, and should not be allowed to pollute the HTTP 200 latencies - which could cause an error in judgment when planning to improve your request latencies.
Another important matter is the choice of the type of metrics for latency. Average or rate are not good choices for latency metrics as a large latency outlier can get averaged out and would blindside you. This outlier - otherwise called “tail” can be caught if the latency is measured in buckets of requests. Pick a reasonable number of latency buckets and count the number of requests per bucket. This would allow for plotting the buckets as histograms and flush out the outliers as percentiles or quartiles.
Traffic refers to the demand placed on your system by its clients. The exact metric would vary based on what the system is serving - there could also be more than one traffic metric for a system. For most web applications this could be the number of requests served - in a specific time frame. For a streaming service like youtube, it can be the amount of video content served. For a database, it would be the number of queries served and for a cache, it could be the number of cache misses and cache hits.
A traffic metric could be further broken down based on the nature of requests. For a web request, this could be based on the HTTP code, HTTP method or even the type of content served. For video streaming, service content downloads for various resolutions could be categorized. For YouTube, the amount and size of video uploads are traffic metrics as well. Traffic can also be categorized based on geographies or other common characteristics. One way to measure the traffic metrics is to calculate traffic as a monotonically increasing value - usually of the metric type “counter” and then calculate the rate of this metric over a defined interval - say 5 minutes.
This is measured by counting the number of errors from the application and then calculating the rate of errors in a time interval. Error rate per second is a common metric used by most web applications. For eg: errors could be 5xx server-side errors, 4xx client-side errors, 2xx responses with an application-level error - wrong content, no data found, etc. It would also use a counter-type metric and then a rate calculated over a defined interval.
An important decision to make here would be what we can consider as errors. It might look like the errors are always obvious - like 5xx errors or database access errors, etc. But there is another kind of error that is defined by our business logic or system design. For example, serving wrong content for a perfectly valid customer request would still be an HTTP 200 response, but as per your business logic and the contract with the customer, this is an error. Consider the case of a downstream request that ultimately returns the response to an upstream server, but not before the latency threshold defined by the upstream times out. While the upstream would consider this an error - as it should be, the downstream may not be aware that it breached an SLO (which is subject to change and may not be part of the downstream application design) with its upstream and would consider this a successful request - unless the necessary contract is added to the code itself.
Saturation is a sign of how used or “full” the system is. 100% utilization of a resource might sound ideal in theory, but a system that’s nearing full utilization of its resources could lead to performance degradation. The tail latencies we discussed earlier could be the side effect of a resource constraint at the application or system level. The saturation could happen to any sort of resources that are needed by the application. It could be system resources like memory or CPU or IO. Open file counts hitting the max limit set by the operating system and disk or network queues filling up are also common examples of saturation. At the application level, there could be request queues that are filling up, the number of database connections hitting the maximum, or thread contention for a shared resource in memory.
Saturation is usually measured as a “gauge” metrics type, which can go up or down, but usually within a defined upper and lower bound. While not a saturation metric, the 99th percentile request latency (or metrics on outliers) of your service could act as an early warning signal. Saturation can have a ripple effect in a multi-tiered system where your upstream would wait on the downstream service response indefinitely or eventually timeout - but also causing additional requests to queue up, resulting in resource starvation.
While the Golden signals covered in this blog are metrics driven and ideally a good starting point to measure if something is going wrong, they are not the only things to consider. There are various other metrics not necessary to track on a daily basis, but certainly an important place to investigate when an incident takes place. We will be covering this in Part 2 of this blog series.
Irrespective of your strategy, understanding why a system exists, what are the services, and business use cases it serves, are vital. This will lead you to identify the critical paths in your business logic and help you model the metrics collection system based on that.
Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.