Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How Catalog changes the game for long-term maintenance

Every incident platform needs to know who owns what. Which team owns which service. Which backlog to send follow-ups to. Which escalation path to page when something breaks. The problem is that most platforms encode this ownership logic separately in every configuration: alert routing, workflows, ITSM ticket syncing, and more. Each one maintains its own copy of the same information, in its own format.

Product Update - March 2026

IncidentHub's latest product updates focus on improving the public status page, adding integrations with ticketing systems, private status page ingestion, and making the notifications more useful to the end user. Some of these improvements are driven by user feedback. Feedback is what makes the product better, and I am personally grateful to all our customers who have shared their feedback with us.

How agentic AI for ITOps overcomes observability tool gaps

As enterprise ITOps teams monitor increasingly complex, cloud-based, containerized systems, traditional observability practices are struggling to keep up. As IT infrastructure complexity increases, the typical response is to layer on more monitoring, logging, and instrumentation.

Event Intelligence for Agentic IT Operations

Modern IT teams are experimenting with AI agents. But individual agents, working in isolation are not enough. To truly achieve Agentic IT Operations, organisations need a platform — one that coordinates, governs, and contextualises AI-driven actions across the entire IT landscape. That’s where Interlink Software comes in.

The Incident You Never Had: Deterministic Simulations w/ Will Wilson (Antithesis CEO)

Most reliability engineering happens after something breaks. Will Wilson thinks that's the wrong place to be. As co-founder and CEO of Antithesis, the autonomous testing platform that just raised $105M in a Series A led by Jane Street, Will has spent years building the infrastructure to catch failure modes before they ever reach production. His starting point is uncomfortable: the testing practices most teams rely on are structurally incapable of finding the bugs that cause real incidents.

Beyond the pager: what to do when Opsgenie sunsets

OpsGenie is going away in 2027, forcing a migration decision for thousands of teams. But this isn't just a tooling swap — it's a rare chance to upgrade how you respond to incidents. Because the real pain in incident response isn’t paging. It’s everything that happens after the alert: coordination, clarity, communication, ownership, and follow-through. Most teams solve this through heroics and tool-juggling across chat, tickets, and docs. That approach doesn't scale.

incident.io product showcase: Post-mortems

A full walkthrough of our completely rebuilt post-mortems experience. We cover AI-generated first drafts from your incident data, accuracy review, inline rewriting, a collaborative editor with live incident context, meeting notes with Scribe, and management tooling including dashboards, exports, and analytics. Post-mortems are included in incident.io Response. AI features and Scribe are available on Pro and Enterprise plans.

Announcing the 2026 State of AI-First Operations Report

For years, our annual State of Digital Operations report has been the industry benchmark for understanding how organizations manage incidents, build resilience, and evolve their operational practices. Each year, we survey hundreds of business and operations leaders worldwide to capture the challenges, priorities, and emerging practices shaping digital operations.

Incident Response Reimagined: Accelerating Resolution with AI Agents

Learn how PagerDuty is leveraging Agentic AI to transform the incident lifecycle from reactive firefighting to proactive prevention. Manuel Reis, Software Developer at PagerDuty, demonstrates how new tools like the SRE Agent and Scribe Agent assist engineers during high-pressure outages by autonomously triaging alerts, querying logs in tools like Grafana, and transcribing context directly into incident channels.