Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

When AI Thinks and Humans Act: The Future of Operational Resilience

Artificial Intelligence has become the sharpest tool in the digital arsenal – detecting anomalies, predicting failures, and uncovering risks before they unfold. Yet even the smartest system can’t roll up its sleeves and fix what’s broken. AI can see the problem. But only people can solve it. That’s the critical gap in today’s automation revolution: turning AI’s insight into human action.

Reliability lessons from the 2025 AWS DynamoDB outage

On October 19th and 20th, 2025, the AWS region US-EAST-1 suffered a massive outage. What started with a 3-hour Amazon DynamoDB outage from a DNS issue led to an Amazon EC2 outage that lasted an additional 12 hours before normal service was restored. Over the course of the outage, there were over 17 million outage reports as companies like Snapchat, Roblox, Amazon, Reddit, Venmo, and more were impacted.

Unlock Faster Incident Resolution with PagerDuty + Logz.io

Join us live as we demo how PagerDuty and Logz.io work together to supercharge your Root Cause Analysis. See how real-time observability and enriched incident context can help your team detect, triage, and resolve issues in minutes—not hours. Don’t miss this chance to see the integration in action, ask questions, and learn how to keep your teams in sync while driving continuous improvement. Perfect for anyone looking to level up their incident response!

Top 10 Hospital Messaging Systems (2025): Comparing Communication Tools for Modern Care Teams

Secure and seamless communication is at the heart of effective patient care. Whether coordinating handoffs, requesting consults, activating code teams, or managing after-hours coverage, clinicians rely on messaging systems that are reliable, fast, and built to protect patient data.

Triaging an Incident with a Critical Data Pipeline at #rivian

Rivian makes electric vehicles to advance its mission to keep the world adventurous forever. As software defined vehicles, Rivian’s R1T and R1S are connected to the cloud from day 1, and telemetry data is at the heart of enabling mobile notifications, remote diagnostics, fleet management, and more. With so many critical pipelines in the cloud, observability is a top priority for the data platform.

Work Where Your Teams Already Are with PagerDuty's AI Agents for Slack

Modern operations happen in Slack, where teams spend their days collaborating, troubleshooting, and resolving incidents. And while many incident management tools offer Slack-friendly experiences, they lack end-to-end capabilities that teams need. During critical moments, other tools may require users to switch between Slack and their own interfaces, creating friction.