Our favorite(top?) SRE talks
Over the years there have been a bunch of great talks on site reliability and incident response. Below are a few we thought stood out(in no specific order) and is defintely worth a peek.
Over the years there have been a bunch of great talks on site reliability and incident response. Below are a few we thought stood out(in no specific order) and is defintely worth a peek.
For folks who’ve made post mortems more meaningful at your company, it is important that you spread that learning around. A lot of companies have teams that do postmortems really well and a lot of engineering managers(EMs) want to spread it organically, but writing and following postmortems is the kind of practice that a lot of devs really just don’t think about or care about and it can get extremely hard to force this practice, especially without support from upper management.
In the following example, we will be creating a histogram in Grafana. Our datasource is Prometheus’s cumulative histogram. I have captured the metrics using micrometer’s distribution summary.
One of the most popular cloud disaster recovery models in the industry today is the “pilot light” model where critical applications and data are in already place so that it can be quickly retrieved if needed. A simple question one must ask before adopting this model is what thought has been given to whether the AWS/GCP/Azure APIs will work and if the requisite capacity will be available in the alternate region.
This tip is for those who are using Prometheus federation to monitor multiple clusters. How should alertmanager be configured for multiple clusters? Let us say that if there’s an issue for Cluster A it only needs to send an alert for cluster A? In such cases, every alert should be routed to proper team based on labels (if there is problem with application A on cluster B - team responsible should be notified). In the above case, two alerts are triggered by the same rule.
In order to have a pipeline with great conversion rates, one must integrate a number of design and copy updates into your application funnel for trust-building and user empowerment. These are also called service evidence, a term comes from The Design of Everyday Things by Don Norman.
One of the first things incident managers do when they get an alert page from Zenduty is to check the “Context” tab of the incident. Incident context is extremely critical to get a first responder’s view of what happened and what could possibly have caused it. Context tells you what happened before an incident. In the case of 40–50% of all incidents, Zenduty’s incident context can tell you within 5–10 seconds, what could be the cause of an incident.
Recently, one of our customers, a 20-member NOC team of a large B2C company, had set up Zabbix to monitor a network of over 1000+ servers, routers, and switches. The NOC team wanted to set up alerting, on-call scheduling, and an escalation matrix whenever a critical network component encountered any downtime. The NOC team used Slack as the primary communication channel and Zoom for real-time communication. For NOC teams like these running a very large operation, setting up alerting can be very tricky.