Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Whose fault was it anyway? On blameless post-mortems

No one wants to be on the receiving end of the blame game—especially in the wake of a major incident. Sure, you know you were the one who made the final change that caused the incident. And hopefully, it was a small one that didn’t cause any SEV-1s. Still, the weight of knowing you caused something bad should be enough, right? Unfortunately, sometimes fingers get pointed, your name gets called, and suddenly, everyone knows that you’re the person who created more work for everyone.

Choosing the Right Metrics for Noiseless K8s Alerting

Watch Ankur Rawal and Dheeraj Reddy talk about how to choose the right metrics for noise K8s alerting, with insights and suggestions based on the mistakes made by hundreds of companies while implementing Prometheus Alertmanager in their production systems, and learn how much bad monitoring could be costing you. This talk was delivered at PromCon'2023 in Berlin.

What Is the Role of an Incident Commander?

For most businesses, managing major incidents can be intimidating. With a swarm of information coming from different directions, keeping things organized and maintaining clear, effective communication is tough. It only gets worse when there's no defined process to follow. This disorganization confuses everyone, delays responses, and increases the incident escalation rate. Enter the incident commander (IC).

Incident response and awareness acceleration: What we can learn from responders of Queenstown floods.

I was visiting Queenstown, New Zealand last week amidst the horrible floods which quickly escalated. As an incident responder myself, I was amazed at the operations and how fast responders on the ground acted in evacuating and clearing the grounds. Over 100 people were evacuated in the middle of the night with zero casualties. A commendable job. Here are some observations I made and what we can learn as incident responders ourselves..

A Journey through the Blameless Resource Library

From the very beginning of Blameless, we had two vital missions. First, to offer a solution to what we saw as a mounting crisis of reliability by offering a comprehensive, easy-to-use, reliability platform. Second, to educate the companies facing this crisis on the fundamentals of incident management, cutting-edge best practices, and the cultural values that sustain learning and growth.