Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Response Team @ incident.io

When an incident hits, every second counts. The response team at incident.io builds the tools that make sure engineers aren't flying blind when it matters most. Sam, Tech Lead of the response team, takes us inside what it's really like to build the core of incident.io: the high technical bar, the art of prioritisation, and why there's no shortage of meaningful work to do. If you're an engineer who wants to work on something that genuinely makes other engineers' lives better, this one's for you.

Platform Engineering 101: What It Is, How It Differs from SRE and DevOps, & Why It Matters for Incident Response

Platform engineering has emerged as a response to the growing complexity of modern software delivery. As organizations adopt Kubernetes, microservices, CI/CD pipelines, and infrastructure as code, they are creating dedicated teams responsible for building and operating the internal platforms that power developer workflows.
Sponsored Post

Forwarding Microsoft SCOM Alerts to the Service Desk

Modern IT operations rely heavily on monitoring solutions like System Center Operations Manager (SCOM) to detect issues across servers, applications, and services. While SCOM excels at generating alerts, organizations often struggle to ensure these alerts translate into actionable incidents in their IT Service Management (ITSM) platforms. Without proper integration, critical alerts may be missed, tickets may be created manually, and incident resolution can be delayed.

AI Engineering at incident.io

Working on AI in incident management means there's no playbook. No million blogs. Just building at the forefront of what's possible with AI models.In this video, Martha, Product Engineer on our AI team, talks about what it's really like working with AI that helps engineers respond to incidents faster. This covers the shift from traditional engineering, learning the personalities of different AI models, and why you need to embrace constant change when new models drop all the time.

YouTube Outage (Feb 17, 2026). What Happened?

On February 17, 2026, YouTube went down for users worldwide. Starting around 8:00 PM ET, the platform's homepage, Shorts feed, sign-in system, smart TV apps, YouTube Music, and YouTube Kids all stopped working. Over 21,000 reports were logged on IsDown alone. The error message was the same everywhere: "Something went wrong." For consumer users, it was an inconvenience. For businesses that depend on YouTube — content teams, advertisers, media companies, live streamers — it was a blind spot.

The post-mortem problem

Post-mortems are required, time-consuming, and widely disliked — but they’re also one of the biggest opportunities to improve reliability. In this webinar, we talked about how to run post-mortems that actually lead to learning and improvement. This covered why most post-mortems fall flat, how to structure them effectively, and walk through a real example to show what good looks like in practice. The goal: fewer wasted hours, better outcomes, and post-mortems that actually matter.

AI Is Changing Healthcare Faster Than Most Systems Are Ready For

Healthcare is shifting fast, and artificial intelligence is no longer a future concept sitting in research labs or pilot programs. It’s already embedded in clinical workflows, operational systems, and patient interactions, often in ways that feel subtle, uneven, and sometimes uncomfortable.