Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Squadcast Strengthens Its Leadership in IT Alerting and Incident Management in the G2 Spring Report

2025 has already started out to be a remarkable year for Squadcast—with our key wins in the G2 Spring Reports, our acquisition by SolarWinds, and a series of impactful product releases and improvements. Our mission has always been clear: to deliver a unified platform that seamlessly integrates On-Call Management and Incident Response, empowering teams to boost service reliability and productivity—all without the burden of context switching.

Opsgenie Is Sunsetting: What to Look for in an Alternative

Atlassian is retiring Opsgenie, and if you're one of the teams relying on it to manage on-call and incidents, you're facing a tough question: Do you make the forced migration to Jira Service Management or Compass, scramble for a lookalike tool — or use this moment to upgrade your entire approach to incident response? If you’re facing that decision, we get it. Changing tools midstream isn’t ideal (to say the least). But it’s also a rare opportunity to take a meaningful step forward.

Metrics That Matter: Measuring Developer Productivity in the AI Era

In this episode, Ryan McDonald is joined by Mark Quigley, Head of Platform Engineering at Ninety.io, for a conversation that cuts through the noise around developer productivity metrics and AI. Mark dives deep into how teams can measure what matters—without falling into the trap of turning every measure into a target. He shares how tools like Developer NPS, DORA metrics, and balanced scorecards can help teams optimize for both output and well-being—but only when framed with the right intent.

The timeline to fully automated incident response

We speak to engineering teams every day, and everybody knows AI is the future. Some tell us they’re massively accelerated by Claude, or that they’re rebuilding their product, team and ways of working. Cursor and Lovable have announced they’re building the last piece of software. Should we give in to the vibes? Embrace exponentials, and forget that the code even exists? The reality is that things will still go wrong. They always do, at least from time to time.

Infrastructure Monitoring: A Comprehensive Guide to Integrating Effective Alerting

Imagine you’re the IT guardian of a busy company. Every day, you rely on infrastructure monitoring tools to keep an eye on your servers, networks, and applications. These tools are your early warning system – they spot glitches before they become full-blown problems. But what happens when an alert is missed or delayed? That’s where effective alerting comes in.

Mastering incident routing: a critical component in incident management

Imagine this: a high-priority alert is triggered, but it’s routed to the wrong team, or delayed by manual triage. By the time the right person is notified, the issue has escalated, and users are starting to notice. Technical failures don’t always cause these kinds of incidents. More often, they stem from something simpler: poor alert routing.

How to Fine Tune Your IncidentHub Alerts

IncidentHub can send outage alerts to many external systems. You can choose from Slack, Webhook, Email, Discord, PagerDuty, and more. Alerts are effective only when they are relevant and actionable. In this article, we will explore how to fine-tune your IncidentHub alerts to receive only the relevant ones for your third-party services.

OpsGenie vs. PagerDuty: Which Incident Management Tool Should You Choose in 2025

If you’re comparing OpsGenie vs. PagerDuty, there’s something important you need to know right away: OpsGenie is shutting down. OpsGenie has been a trusted ally for incident teams for over a decade. In our Ode to OpsGenie, we celebrated its legacy—from simplifying on-call rotations to reducing alert noise effectively. Atlassian announced that OpsGenie sales will stop on June 4, 2025, with a complete shutdown by April 5, 2027.

Incident management vs. problem management: A practical guide for SREs

In Site Reliability Engineering (SRE), distinguishing incident management from problem management is crucial. While both processes aim to maintain system reliability, they fulfill distinct roles: incident management focuses on quickly resolving immediate disruptions, whereas problem management identifies and rectifies root causes to prevent recurrence. Effectively combining these processes helps minimize downtime, enhances system resilience, and fosters a proactive operational approach.

Do You Still Need an ITSM Platform in 2025?

The world of IT has undergone a seismic shift over the past two decades. What was once a landscape dominated by physical servers, on-premise data centers, and monolithic applications has transformed into a dynamic ecosystem of cloud-native architectures, microservices, and distributed systems. Yet, many enterprises still rely on traditional IT Service Management (ITSM) tools that were designed for a bygone era.