Operations | Monitoring | ITSM | DevOps | Cloud

Blameless

How SLIs Help You Understand Users' Needs

In our article on SLOs, we discussed the need for service level indicators to be relevant to the users’ experience. By consolidating a number of internal metrics into one indicator that reflects the typical use of the service, we can ensure that meeting our SLO means keeping users happy. A good way to think about this is by looking at the user’s experience or journey.

Top Practices for Runbook Automation

Runbooks, also known as playbooks, are documents that walk you through a certain task with specific steps. For example, a runbook for spinning up a new server might ask some questions about the purpose of the server and its estimated load, then lead you to the appropriate instructions and settings. Runbooks ease the cognitive load of these common tasks by clearly outlining the process for each.

SRE: A Human Approach to Systems

In the world of technology, the stakes have never been higher. The move to the cloud and microservices to maximize agility has given way to digital disruptors and unprecedented competitive threats. As distributed systems become increasingly complex, the scale of ‘unknown unknowns’ increases. On top of this, customer expectations are sky-high. The cost of downtime is catastrophic, with customers willing to churn if their needs are not promptly met.

Best Practices for Effective Incident Management

Incident management is a set of processes used by operations teams to respond to latency or downtime, and return a service to its normal state. Incident management practices have long been well-defined through frameworks such as ITIL, but as software systems become more complex, teams increasingly need to adapt their incident management processes accordingly.