Operations | Monitoring | ITSM | DevOps | Cloud

How We Built Internet's Largest Incident Response Glossary for the Wider Community

Today, I’m excited to share the Internet’s Largest Incident Response Glossary. It’s a collection of over 500 terms covering on-call, alerting, monitoring, and system reliability. It took us over 2 weeks from ideation to completion of this project and in this post, I would like to share how we approached this beast!

Designing smarter on-call schedules for faster, calmer incident response

When an incident wakes your team early in the morning, the last thing you want is confusion about who’s responding or how help will arrive. An effective on-call schedule doesn’t just get the right person online. It helps them stay calm, confident, and capable of solving problems quickly. Done right, your on-call setup becomes a powerful lever for reducing Mean Time to Acknowledge (MTTA), Mean Time to Resolve (MTTR), and the overall stress that incidents place on your team.

Top 5 Incident Response Platforms for 2025

An incident response platform helps organizations manage, track, and resolve IT incidents quickly and efficiently. With the right platform, teams can minimize downtime, reduce the impact of incidents, and improve overall response times. ‍ In this article, we’ll explore the top 5 incident response platforms for 2025, helping you choose the best solution for your needs. ‍

Step-by-step guide for incident response automation (+ tools & tips)

Every minute matters when you're dealing with a security incident. The longer a breach goes undetected and unresolved, the more damage it can cause to your systems, data, and reputation. But traditional incident response is plagued with challenges: alert fatigue, manual processes, skill shortages, and the sheer complexity of modern IT environments. Security teams are drowning in alerts while struggling to respond quickly enough to the threats that matter.

The timeline to fully automated incident response

We speak to engineering teams every day, and everybody knows AI is the future. Some tell us they’re massively accelerated by Claude, or that they’re rebuilding their product, team and ways of working. Cursor and Lovable have announced they’re building the last piece of software. Should we give in to the vibes? Embrace exponentials, and forget that the code even exists? The reality is that things will still go wrong. They always do, at least from time to time.

Postmortem Template to Optimize Your Incident Response

A postmortem template is a structured tool for documenting incidents, understanding their causes, and learning how to prevent them in the future. This article explains the essential elements of an effective postmortem and how ilert can streamline this process, making your incident response more efficient. It also offers a downloadable version of a postmortem template that you can use if you haven't yet utilized an incident management platform in your organization.

Incident Response Management: A Category of Its Own

In recent weeks, I’ve spoken with several Opsgenie customers who are evaluating a migration to ilert after Atlassian’s decision to phase out Opsgenie and fold its functionality into other products. Atlassian is giving Opsgenie users “two options: move to Jira Service Management for robust end-to-end incident management, or move to Compass for alerting and on-call management.” This has raised a broader question in our industry: ‍

Zendesk outage: A case for proactive monitoring and faster incident response

On March 20, 2025, starting at 15:43 AM UTC, Zendesk users globally encountered 503 “Service Unavailable” errors and 5xx server-side issues, disrupting access to critical support tools and communication channels. While immediate mitigations stabilized core services, intermittent issues continued for over 24 hours, underscoring the complexity of multi-pod infrastructure failures.

Incident response and on-call management in one app: Introducing Grafana Cloud IRM

At Grafana Labs, we’re always searching for ways to develop products that give our users the best tooling to help in their day-to-day understanding of their systems. We built OnCall and Incident in Grafana Cloud, our fully managed observability platform, to make it easier to respond to and fix incidents — all on top of the Grafana dashboards you know and love.