Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Going beyond MTTx measuring what "good" incident management looks like

Traditional MTTx metrics have long been the go-to measure for incident management effectiveness, but they often fail to provide a full picture or drive meaningful improvements. We analyzed data from over 100,000 incidents to develop new industry benchmark metrics that better define what "good" incident management looks like.

Rethinking WhatsApp Alerts - A Data-Driven Approach

WhatsApp has become a major alerting channel for incident response teams. It's popular and for many, a great alternative to SMS. In our 2024 recap, we mentioned how Spike sent over 25,000 alerts on WhatsApp. It is now the 2nd most used alert channel for responders on Spike (rising from 4th spot in 2023). But... I will be the first one to admit – the WhatsApp alerts experience needed work to help responders react to incidents quicker!

PagerDuty Setup: From Beginner to Pro in 10 Steps

This comprehensive guide walks you through the complete PagerDuty setup process, organized into 10 steps. We've structured the guide to match your team's growth journey—starting with essential configurations for small teams, advancing to robust solutions for growing teams, and wrapping up with enterprise-grade features for large organizations. By the end, you'll have a fully operational incident management system set up on PagerDuty tailored to your specific needs.

Finding the Right Tools for Digital Transformation

Given the current climate in the federal government, it’s critical that public sector IT leaders find innovative solutions to do more with less. That’s a real challenge for these leaders who must balance with current alert backlogs against their agency limited IT budget and resources. Everyday, more than a thousand alerts to track down and as response times are slowing and some incident managers are burning out.

Feature Spotlight - Task Lists

When an incident occurs, teams often perform a known set of steps in a specific order to help identify and triage the incident. For Base and Advanced plan users, the Incidents menu includes a Task Lists section where teams can build out priority lists for different incident types or use cases. For example, a list of failover tasks, or the tasks required to perform a deployment rollback. With task lists, Incident Commanders can be sure that resolvers know exactly what needs to be done to quickly resolve incidents.

Scientific Incident Management with Dan Slimmon

Dan Slimmon is an incident management veteran who's worked at Etsy, HashiCorp, and now leads consulting and training on pragmatic, non-bureaucratic incident response. In this episode, Dan shares his philosophy on "scientific incident response," the importance of hypothesis-driven troubleshooting, and why incidents should be seen as normal in complex systems.

Opsgenie is shutting down. Here's what that means, and how incident.io can help

Atlassian recently announced they’ll be shutting down Opsgenie, their popular on-call alerting tool. After June 4, 2025, no new Opsgenie accounts will be created, and by April 5, 2027, the service will shut down completely. Users don’t seem happy about it. If you’re currently using Opsgenie, this news is significant. A key part of your incident response process is disappearing, and Atlassian suggests moving to their other products, like Jira Service Management or Compass.

A seven-step framework for running incident debriefs

Ever wrapped up an incident, thought 'Phew, glad that’s over,' only to feel your stomach drop when you see the dreaded "Incident Debrief" on your calendar? We've all been there. Incident debriefs don't need to feel like sitting through your least favorite school subject. They can (and should!) actually be engaging and useful. At incident.io, we've found a simple, repeatable, and blameless framework.

How we responded to a 2+ hour partial outage in Grafana Cloud

On Tuesday, Feb. 18, 2025, we experienced an outage that lasted approximately 150 minutes and impacted roughly 25% of our Grafana Cloud services. To our customers: we are very sorry and more than a little embarrassed that we stepped outside our own processes and advice to cause this. You rely on us to help monitor and troubleshoot your environments, and this type of incident obviously makes it harder for you to do that.