Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How to Set Up SMS Alerting w/ OnPage

In this quick tutorial, learn how to set up SMS alerting in OnPage to ensure your team never misses a critical notification. We’ll walk you through the step-by-step process: This setup ensures reliable message delivery using redundancy rules, so important alerts reach the right person at the right time. Let us know if you have any other questions!

Why SIGNL4 Is the Right Alarm Management Software to Maximize Machine Availability

A plant runs at its best when equipment stays online, processes remain stable, tolerances are met, raw materials are delivered in time, and scrap stays low. That’s how operations teams hit production targets, meet customer SLAs, stay on schedule, keep costs under control, and maintain consistent quality. But does everything always run according to plan? Of course not.

Code Is Cheap, Reliability Isn't: Owning Production in the AI era w/ Swizec Teller

In this episode, Swizec Teller, author of the bestselling Scaling Fast, makes a bold claim: code is cheap, reliability is not. As AI coding tools accelerate feature development, the real competitive advantage shifts to operating systems reliably in production. We explore the hidden complexity of SRE work, the addictive nature of agentic coding, and why ownership — not automation — remains at the core of modern software engineering.

Amazon Web Services outage - February 10, 2026

On February 10, 2026, Amazon Web Services (AWS) experienced an outage that triggered widespread reports of CloudFront failures and DNS resolution issues. While AWS later acknowledged the incident, StatusGator detected the disruption earlier using Early Warning Signals, giving customers valuable lead time before the provider confirmed anything publicly.

4 on-call burnout signs (and how to address them)

Being on-call can sometimes feel overwhelming. If that feeling goes unnoticed for too long, it often translates into burnout. And early burnout signs usually show up in ways, like how people respond to incidents or how they feel about the schedule. This guide walks through four such signs that can be useful to watch for before on-call burnout sets in.

Claude outage - February 10, 2026

On February 10, 2026, Claude users around the world began reporting service failures affecting chat sessions, API integrations, and Claude Code workflows. The first verified outage report reached StatusGator at 19:33 UTC. StatusGator issued an Early Warning Signal at 20:24 UTC. Claude did not post an official “Investigating” update until 22:11 UTC. This incident clearly demonstrates the gap between real user impact and official status page updates.

5 Offbeat on-call rotations that work

Most teams choose standard on-call patterns like weekly or daily rotations. But sometimes a less conventional rotation can solve a specific problem or just fit better with how your team works. This guide walks you through five offbeat on-call rotations. For each, we look at why it might work for you and the challenges involved. This helps you see the full picture before you decide to try them out. Let’s dive in!

Follow-the-sun and other on-call models

Most teams run on-call using rotation-based schedules where responsibility shifts every few days or weeks. But some situations call for different models that change who responds based on time zones, expertise, or the type of incident that triggers. This guide walks you through six on-call models that work outside the standard rotation patterns.

Turning Data Into Decisions with the xMatters Incident AI Agent

When an incident hits, the gap between awareness and action can make all the difference. Responders know the pain: endless tool-switching, chasing updates, and fragmented data. It’s not a lack of capability that slows response; it’s the lack of context and connection. That’s why we built the xMatters Incident AI Agent, a purpose-built, conversational assistant that brings intelligence and automation directly into the heart of incident response.