Operations | Monitoring | ITSM | DevOps | Cloud

Who's on call? How Claude helped us calculate this 2,500x faster

Schedules are a core part of any on-call system. In ours, they define who to page and when. But people use them in lots of other ways too: checking their next shift, asking for cover while at the gym, keeping a Slack user group up to date, or updating a Linear triage responsibility. For many of our customers, they’re one of the main ways they interact with our product, and as they’re such a foundational part of On-call, it’s very important they work well.

What does using AI for post-mortems actually mean?

Everyone is using AI to help with post-mortems now. The pitch is obvious: post-mortems are time-consuming, the blank page is brutal, and AI is very good at producing structured, confident-sounding documents quickly. We're not here to push back on that. We've built AI into our own post-mortem experience, pulling your Slack thread, timeline, PRs, and custom fields together and giving your team a meaningful starting point in seconds. We think that's genuinely valuable, and the teams using it agree.

How it feels to run an incident with AI SRE

We've been building the broader incident.io platform for several years now, and one thing we've learned is that UX matters more here than almost anywhere else. When an incident fires, there's no room for poorly designed interfaces or fumbling through features you haven't touched in a while. The product has to be ergonomic: easy to pick up, easy to navigate, with the right things at your fingertips at exactly the right moment. We've put a lot of effort into this over the last 5 years.

Why post-mortem action items die

You can run the best debrief of your life. Honest timeline, blameless tone, real insights. People leave the room nodding. And then nothing happens. This is the last mile problem of post-mortems - and it's an easy trap to fall into. When you've just been through a stressful incident, getting it back up is the priority. Once it's over, the post-mortem itself can feel like the finish line. You've documented what happened, been honest about it, identified what went wrong. It feels like the work is done.

How to migrate your paging tool without breaking your team

Most engineering teams don’t migrate their on-call and paging systems unless absolutely necessary. No matter how painful their current solution, it's one of those changes that people put off for as long as possible because the cost is real. The disruption, the retraining, the risk of missing a critical page during the transition. It's not something you do on a whim.

How Catalog changes the game for long-term maintenance

Every incident platform needs to know who owns what. Which team owns which service. Which backlog to send follow-ups to. Which escalation path to page when something breaks. The problem is that most platforms encode this ownership logic separately in every configuration: alert routing, workflows, ITSM ticket syncing, and more. Each one maintains its own copy of the same information, in its own format.

The post-mortem problem

Post-mortems are one of the most consistently underperforming rituals in software engineering. Most teams do them. Most teams know theirs aren't working. And most teams reach for the same diagnosis: the templates are too long, nobody has time, and nobody reads them anyway. These aren't wrong observations. But they're symptoms, not causes. The actual problem is that somewhere along the way, the post-mortem stopped being a piece of communication and became a compliance artifact.

Keeping it boring: the incident.io technology stack

At incident.io we run a deliberately simple technology stack. Keeping things boring has allowed us to scale from a few hundred customers to several thousand, while having only two platform engineers. In this post I'll walk through the stack, explain some of the choices we've made, and touch on the challenges we're facing as we grow.

Secure access at the speed of incident response

Picture this: it's 2am, your pager goes off, and you're staring at a production database that's on fire. You know exactly what's wrong. You know exactly how to fix it. But you can't touch anything because you're waiting on someone to approve your access request. Meanwhile, your customers are down, your SLAs are bleeding out, and you're refreshing Slack hoping someone in security is awake to click "approve." This is the incident response tax that too many teams pay.

Everything you need to know about ITIL 5, AI and incident management

ITIL 5 launched in January 2026, and for the first time in the framework's 40-year history, AI governance is front and center. If you're running incident management, on-call rotations, or building operational tooling, this matters: the gap between AI adoption and AI governance is about to become a compliance and operational risk issue. I’m not usually a big ITIL fan, but this guidance has some genuinely useful framing and questions.