%term

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

On-Call Incident Response When Outages Are the New Normal

Aug 3, 2026 By Falit Jain In Pagerly

If your engineering team feels like the outage alerts have gotten louder in 2026, the data agrees with you. Strong on-call incident response has quietly become the difference between a five minute blip and a headline. In the week of July 20 to 26, 2026, ThousandEyes tracked 610 global network outage events, up 4 percent from the 587 the week before, with United States outages rising 10 percent to 457 (Network World).

Read Post

Pagerly

Read more about On-Call Incident Response When Outages Are the New Normal

Third-Party Outages: On-Call Lessons From Q2 2026

Aug 2, 2026 By Falit Jain In Pagerly

Cloudflare just published its Q2 2026 Internet Disruption Summary, and the biggest lesson for on-call engineers is uncomfortable: most of the outages that ruin your week are not caused by your own code. Third party outages, upstream provider failures, cable cuts, DNS misconfigurations, and government shutdowns all produce the exact same symptom your users care about, which is that your product stops working.

Read Post

Pagerly

Read more about Third-Party Outages: On-Call Lessons From Q2 2026

Cloud Outage Response: Lessons From July 2026

Aug 2, 2026 By Falit Jain In Pagerly

Cloud outage response got a brutal stress test in July 2026. In the span of nine days, three separate cloud infrastructure failures took large chunks of the internet offline: AWS CloudFront on July 16, Microsoft Azure West US on July 23, and AWS us-west-2 on July 24. None of them were caused by a dramatic data center fire or a nation state attack. They were routing faults, configuration translation bugs, and a piece of networking hardware on the path between a region and a metro area.

Read Post

Pagerly

Read more about Cloud Outage Response: Lessons From July 2026

Incident Response Lessons From a 3 GW Grid Drop

Aug 1, 2026 By Falit Jain In Pagerly

When a transmission line faulted in Ashburn, Virginia on July 22, 2026, more than 3 GW of data center load vanished from the PJM grid in seconds. That is roughly three percent of total grid demand at the moment it happened, and the grid took about ten minutes to stabilize instead of the milliseconds a routine disturbance normally requires. For anyone who owns a pager, this is more than an energy story.

Read Post

Pagerly

Read more about Incident Response Lessons From a 3 GW Grid Drop

Incident Response When the Outage Isn't Yours

Aug 1, 2026 By Falit Jain In Pagerly

Most of the outages that will page your team this quarter did not start in your code. They started in the physical world: a storm, a severed fiber cable, a data center losing power, or a government flipping a national switch. That is the uncomfortable takeaway from Cloudflare's Q2 2026 Internet Disruption Summary, published on July 29, and it has real consequences for how on-call teams practice incident response.

Read Post

Pagerly

Read more about Incident Response When the Outage Isn't Yours

Cloud Outage Preparedness: On-Call Lessons for 2026

Jul 31, 2026 By Falit Jain In Pagerly

Cloud outage preparedness stopped being a nice-to-have this month. In a span of roughly 48 hours, Microsoft Azure lost a big chunk of its West US footprint and Amazon Web Services dropped connectivity between its us-west-2 region in Oregon and the Seattle metro. The AWS event alone rippled outward and knocked DoorDash, Reddit, Hulu, Apple Pay, Snapchat, Fortnite, and the PlayStation Network offline for millions of users, according to incident trackers. Neither outage was caused by anything exotic.

Read Post

Pagerly

Read more about Cloud Outage Preparedness: On-Call Lessons for 2026

Cloud Outage Incident Response: Lessons From 2026

Jul 31, 2026 By Falit Jain In Pagerly

Cloud outage incident response stopped being a hypothetical exercise this summer. In a single stretch of July 2026, three of the biggest cloud providers stumbled in quick succession, and the ripple effects reached apps that millions of people use every day. If your team runs anything on a hyperscaler, the events of the last few weeks are a direct message: the question is no longer whether your provider will have a bad day, but whether your on-call rotation is ready when it does.

Read Post

Pagerly

Read more about Cloud Outage Incident Response: Lessons From 2026

When Status Pages Lie: The Incident Detection Gap

Jul 30, 2026 By Falit Jain In Pagerly

On July 28, 2026, roughly 30,000 people flooded Downdetector with reports that Reddit was broken. Feeds would not load, logins failed, and the mobile app hung. Reddit's own status page, meanwhile, showed a calm wall of green: all systems operational. That contradiction is the whole story, and it is not unique to Reddit. It is one of the most common and most damaging failure modes in modern on-call, and it has a name: the incident detection gap.

Read Post

Pagerly

Read more about When Status Pages Lie: The Incident Detection Gap

T-Mobile SOS Outage: Incident Response Lessons

Jul 29, 2026 By Falit Jain In Pagerly

When more than 140,000 people reach for their phones at once and see nothing but the letters SOS, the topic of incident response stops being an abstract engineering concern and becomes something everyone feels. That is exactly what happened on the evening of July 27 into the morning of July 28, 2026, when a nationwide T-Mobile outage knocked huge numbers of devices into SOS only mode, cutting people off from regular calls, texts, and data.

Read Post

Pagerly

Read more about T-Mobile SOS Outage: Incident Response Lessons

Dashboards aren't (quite) dead

Jul 29, 2026 By Data In Incident.io

Historically, non-technical stakeholders would’ve had most of their data questions answered either through pre-built dashboards or by asking their Data team (or equivalent). Self-serve analytics tools went a step further by offering safe, governed datasets built by Data teams which let non-technical users dig into data without having to worry about how it joins together, how metrics like “revenue” are defined, and so on.

Read Post