Operations | Monitoring | ITSM | DevOps | Cloud

How AI Agents Are Redefining the SRE Role

Even the best site reliability engineers (SREs) spend too much time doing reactive work—triaging incidents, gathering context, escalating to the right teams, and documenting what happened. That work is essential, but it’s not where an SRE’s highest value lies. These engineers are hired to build and maintain resilient systems, not play air-traffic control with every alert that hits their queue.

Announcing a forthcoming integration with PagerDuty + Azure AI SRE Agent for faster incident response

The energy at Microsoft Ignite this year was electric. AI was everywhere, and the possibilities are limitless. As developers and operations teams explore what AI can do, one thing became clear: the future isn’t about switching between tools. It’s about intelligent agents working together to help humans solve problems faster. At PagerDuty, we’re building on that excitement.

From Reactive Response to Systemic Resilience: The System That Gets Smarter With Every Incident

Most operations teams are stuck in a reactive loop: Resolving incidents as they happen, then moving on to fight the next fire. This approach keeps things running in the short term, but prevents responders from documenting their learnings in a way that improves overall system resilience. There are practical reasons for this.

Five key takeaways from EDUCAUSE 2025: Adopting AI while navigating change

Having just returned from the 2025 EDUCAUSE Annual Conference in Nashville, I want to share some insights on the future of campus IT from the higher education technology leaders in attendance. Every year, this conference provides an opportunity for technology providers and higher ed professionals to connect and explore the latest innovations in higher education technology. Two themes emerged as critical priorities.

Why Agentic AI Adoption Is Accelerating in Europe and What Comes Next

Across Europe, the cautious optimism business leaders held towards AI agents has evolved into more widespread enthusiasm. What was once a curiosity is now core to how many European organizations operate, respond, and innovate. According to PagerDuty’s latest agentic AI survey, three-quarters or more of organizations in France, Germany, and the UK are deploying multiple AI agents. This growing confidence reflects a broader trend.

How to Choose an AI SRE Solution

The AI SRE landscape has exploded over the past year, with vendors racing to add artificial intelligence capabilities to their platforms. For engineering leaders evaluating these solutions, the sheer number of options can feel overwhelming. Some vendors are building AI-native solutions from scratch, while others are retrofitting AI onto existing workflows. Cloud providers are embedding agents into their ecosystems, and observability platforms are adding intelligence layers to their telemetry data.

Work Where Your Teams Already Are with PagerDuty's AI Agents for Slack

Modern operations happen in Slack, where teams spend their days collaborating, troubleshooting, and resolving incidents. And while many incident management tools offer Slack-friendly experiences, they lack end-to-end capabilities that teams need. During critical moments, other tools may require users to switch between Slack and their own interfaces, creating friction.

We Built an SRE Agent With Memory And It's Transforming Incident Response

If you feel like your incidents are multiplying while your stack gets more complex by the week, you’re not alone. Event volumes keep climbing, signals live in a dozen tools, and human responders are stretched thin. That’s exactly why we built the PagerDuty SRE Agent—a vendor‑agnostic AI teammate that improves with every response to make the next one faster, smarter, and more reliable.

Too Late to Learn: Why Security Post-Mortems Fail and How AI Can Help

An effective post-mortem can turn a security breach into a blueprint for lasting resilience. But too often, in the stress of an incident, documenting what happened takes a back seat to containment and recovery. The resulting analysis relies heavily on memory, scattered notes, and competing narratives. Valuable context gets lost, timelines blur, and lessons that could strengthen defenses never become institutional knowledge.

Your Next Incident Has Already Started. You Just Haven't Noticed Yet.

The best way to minimize the impact of an incident is to catch it early, before small issues snowball into major disruptions. That requires maintaining healthy systems and ensuring sufficient resources are available when problems arise. But developers and IT operations pros working in large enterprises face a challenge: Complex systems operate in an inherently degraded state. In his essay “How Complex Systems Fail,” Dr.