%term

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Your AI agents are lost: give them a graph

Jul 21, 2026 By Rootly In Rootly

The biggest limitation facing enterprise AI agents may not be the model. It may be the context surrounding it. Anthony Alcaraz, Senior AI/ML Portfolio Growth Manager at AWS and co-author of O'Reilly's *Agentic GraphRAG*, joins Humans of Reliability to explain why reliable agents need more than a vector database and a large context window. They need structured knowledge they can navigate, memory they can prune, constraints they can follow, and feedback loops that help them improve.

View Video

Rootly

Read more about Your AI agents are lost: give them a graph

Don't add a read replica until you've read this

Jul 21, 2026 By Engineering In Incident.io

As the size and complexity of their relational database workload grows, every company eventually goes through the process of off-loading work on a read replica. It comes with lots of benefits, but at a cost of increased complexity. This article is about how we dealt with that, a lot of learnings, and some useful techniques. incident.io is an incident management product relied on by thousands of customers to be the thing that supports them through anything from a minor blip to a full outage.

Read Post

Incident.io

Read more about Don't add a read replica until you've read this

Custom shifts for one-off requirements or complex schedules

Jul 21, 2026 By Mandi Walls In PagerDuty

While most on-call schedules are built to represent regular rotations, often on a weekly basis, not all of your on-call needs require the same coverage every week. We’ve added Custom Shifts to the Shift-Based Schedules for maximum flexibility. Custom shifts are a feature of our new Shift-Based Schedules. With Custom Shifts, your team can cover ad hoc needs for special events, major deploys, Failure Fridays, gamedays, or whatever comes up that needs some extra coverage.

Read Post

PagerDuty

Read more about Custom shifts for one-off requirements or complex schedules

Spike's MCP is here

Jul 21, 2026 By Spike - incident response platform In Spike

Connect Spike to Claude, Cursor, or any MCP-compatible tool and manage incidents through natural language. Ask your AI what's critical. Tell it to acknowledge an outage. Have it build an on-call schedule. Spike is now wherever you work.

View Video

Spike

Read more about Spike's MCP is here

H1 2026 Cloud and SaaS Reliability Report

Jul 20, 2026 By Hrishikesh Barua In IncidentHub

The first half of 2026 reinforced a key idea about Cloud and SaaS reliability - dependency risk. IncidentHub tracked 30,246 outages across 1,082 providers between January and June 2026. May was the busiest month, with 6,070 incidents. Cloud providers led in the total number of outages (4,723), followed closely by developer tools (4,589).

Read Post

IncidentHub

Read more about H1 2026 Cloud and SaaS Reliability Report

The July 2026 AWS CloudFront Outage: VPC Origins, Cascade Impact, and What Broke

Jul 17, 2026 By Hrishikesh Barua In IncidentHub

On July 16, 2026, AWS experienced a disruption in its CloudFront service, which affected a large number of websites and applications. The outage was caused by a configuration loading failure in CloudFront's VPC Origins feature. This was AWS's most widely-felt outage after last year's outage on October 20th, which caused widespread damage.

Read Post

IncidentHub

Read more about The July 2026 AWS CloudFront Outage: VPC Origins, Cascade Impact, and What Broke

Trust, Resilience & AI: A Customer Panel with TD Bank & New York Life

Jul 17, 2026 By PagerDuty Inc. In PagerDuty

What does it really take to be "the calm in the storm" during a major incident? In this candid panel from PagerDuty on Tour, Chris Conklin (Technology Executive AIOPs, TD Bank) and Sam Brinley (CVP Enterprise Cloud Solution Architect & Engineer at New York Life) sit down with PagerDuty to talk through two decades of evolution in IT operations – from the "Wild West" of early network management to today's push into AI and agentic operations.

View Video

PagerDuty

Read more about Trust, Resilience & AI: A Customer Panel with TD Bank & New York Life

What is MTTR, and how can agentic ITOps reduce it?

Jul 16, 2026 By BigPanda In BigPanda

Mean time to resolution (MTTR) measures the average duration to restore regular operation for an application, service, or infrastructure component. It’s a key performance indicator (KPI) for IT incident management. To tie MTTR directly to customer satisfaction, you first need to understand how it affects service and application reliability and availability. From there, you can make informed decisions, operate efficiently, and provide a seamless customer experience.

Read Post

BigPanda

Read more about What is MTTR, and how can agentic ITOps reduce it?

On call? Don't miss the next World Cup match.

Jul 16, 2026 By Derdack SIGNL4 In SIGNL4

Plans change - and your on-call schedule should be able to change with them. With SIGNL4, you can quickly arrange shift coverage from your smartphone, so your team stays fully staffed while everyone knows exactly who's on duty. Whether it's a World Cup match, a family event, or any other last-minute plan, SIGNL4 helps you manage stand-ins and shift handovers without phone calls, spreadsheets, or confusion.

View Video

SIGNL4

Read more about On call? Don't miss the next World Cup match.

Software Quality Beyond Code: The Three Dimensions of Reliable IT Operations

Jul 15, 2026 By SIGNL4 In SIGNL4

Many teams invest heavily in code quality, architecture and testing – yet still struggle with outages, slow response times, and unclear ownership. The reason is simple: software quality is about more than technology alone.

Read Post