Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Google Workspace outage on November 12: How StatusGator detected it first

On November 12, 2025, users around the world faced difficulty accessing Google Workspace products including Google Drive, Google Docs, Google Sheets, and Google Slides. While the outage did not impact every user, it was widespread and disruptive. StatusGator detected the incident early using real user data and issued an Early Warning Signal long before Google officially acknowledged the issue.

Jira Service Management (JSM) Review for Incident Management (2025)

Atlassian is shutting down OpsGenie. New sales already stopped on June 4, 2025, and the platform will be completely offline by April 5, 2027. As an OpsGenie user, you now face a critical decision: Migrate to Jira Service Management (JSM), Atlassian’s recommended path, or choose a different solution. And if you’re not sure JSM is the right fit for your team’s incident management needs, this review will help you decide. I signed up for JSM and put it through real-world testing.

Bloom filters: the niche trick behind a 16× faster API

This post is a deep dive into how we improved the P95 latency of an API endpoint from 5s to 0.3s using a niche little computer science trick called a bloom filter. We’ll cover why the endpoint was slow, the options we considered to make it fast and how we decided between them, and how it all works under the hood.
Sponsored Post

Cascading Failures Aren't Inevitable: Lessons from the AWS DNS Outage

AWS outages grab headlines because they affect millions, but the root cause often comes down to something invisible: DNS failures and cascading service dependencies. The complexity of modern cloud systems, combined with the advanced technology powering platforms like AWS, makes these outages particularly challenging to diagnose and resolve. The recent AWS outage proves one thing: you can't prevent every DNS issue, but you can create resilient architectures and prevent a single failure from taking down your entire service if you test for it.

Weaving AI into the fabric of the company | incident.io

At incident.io, we’ve spent the past year shifting how we work to incorporate the AI into both how we build and what we build. The result? AI has become a fundamental pillar of our company. This is the story of how we built reliable AI for reliability itself — reshaping how teams manage and resolve incidents. From early experiments to a company-wide culture of building with AI, this is how we’re redefining incident response for the future.

Replacing AT&T Email-to-Text with OnPage's Critical Alerting

When AT&T officially shut down its email-to-text and text-to-email service on June 17, 2025, a quiet but essential part of many organizations’ communication workflows disappeared overnight. Messages that used to be sent to addresses like simply stopped delivering. For teams who relied on those alerts to reach the on-call clinician, engineer, technician, or service lead — this created an unexpected and urgent gap. This wasn’t just a convenience feature going away.

How Can I Use Categories in SIGNL4 to Quickly Identify Alert Types?

When teams manage a high volume of alerts, it’s easy for things to start blending together. A system outage, a temperature warning, a network slowdown – without a way to quickly identify what’s what, it takes longer to triage and prioritize. Especially on mobile, scrolling through a list of similar-looking alerts can slow your response and add confusion during incidents.