Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Introducing Runner Replicas: Scalable, Reliable Automation for Modern Ops

When you’re responsible for the reliability of complex systems, the execution layer of your automation is not something you want to think about—it should just work. Whether you’re deploying code, patching servers, or responding to an incident at 3 a.m., your automation engine should be as resilient and scalable as the infrastructure it’s operating on.

Service Intelligence Is the Future of Proactive Incident Management

This is the third post in our series on the future of incident management, which builds upon The Future of Incident Management: Your Blueprint for Operational Excellence and How Native Process Automation and Auto-Remediation Drive Operational Excellence. Organizations are facing increasing complexity across their IT landscapes.

What Does a Customer Support Technician Do?

A customer support technician is a technical professional who helps customers solve issues with hardware, software, and IT systems. They’re often the first point of contact when something breaks, whether that’s a computer glitch, a network outage, or a software error. The role is all about troubleshooting, guiding users through solutions, and making sure technology runs the way it’s supposed to.

My Criteria for Automated Incident Response Tools

Managing incidents manually isn’t realistic when their number keeps growing. That’s where automated incident response tools come in. They handle routine tasks so you can focus on actual problem-solving. In this blog, I’ve put together a list of the 9 best automated incident response tools for you. I looked at each one based on four key areas of the incident response process. This will help you see how they handle everything from start to finish.

The Next Wave of Automation Makes More Room for Humans

When a system goes down, the impact isn’t just technical. It’s the people in the center of it who adapt, improvise, apply their judgment, and keep the business moving forward. I’ve worked in operations for more than 25 years, and one thing I’ve learned is that in any system, it’s the humans who are the truly resilient part.

Demo Roundups! Breaking the MTTR Bottleneck: Automating Diagnostics for Modern Incident Response

Discover how PagerDuty Automation eliminates the manual triage bottleneck that's slowing down your incident response. In this demo, you'll see how automating diagnostics can compress resolution times from hours to minutes by instantly analyzing your environment, correlating events across systems, and identifying root causes with transparent AI reasoning.

Introducing the BigPanda observability and monitoring tool rationalization framework

When enterprises run dozens of monitoring and observability tools, performance gaps almost always emerge. By applying the BigPanda Observability Scorecard, our customers consistently see their tool portfolio fall into three groups: In some cases, removing bottom-tier tools can reduce portfolio complexity by double digits while cutting operational noise by as much as 35-40%. This simplification reduces costs while creating a leaner, more reliable monitoring environment that strengthens service availability and operational efficiency.

How to analyze observability and monitoring tools for actionability

Choosing the right observability tools is critical so ensure your teams get actionable insights. In this video, we explore how to evaluate observability platforms based on their ability to detect anomalies, link causes, and trigger effective responses.

From plan to practice to prevail: my conversation with Chris Johnson, host of the MSSP 1337 podcast

In cybersecurity, prevention often gets most of the attention. But no matter how strong your defenses are, incidents will happen. And how you respond in that moment of truth defines resilience. That’s why I really connected with a framework Chris Johnson shared with me on the MSSP 1337 podcast, the 3 P’s – plan, practice, prevail.