Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How Can I Use Categories in SIGNL4 to Quickly Identify Alert Types?

When teams manage a high volume of alerts, it’s easy for things to start blending together. A system outage, a temperature warning, a network slowdown – without a way to quickly identify what’s what, it takes longer to triage and prioritize. Especially on mobile, scrolling through a list of similar-looking alerts can slow your response and add confusion during incidents.

BigPanda Acquires Velocity: Accelerating the Future of Agentic IT Operations

Today marks an exciting milestone for BigPanda and for the future of IT Operations. We’re thrilled to announce that BigPanda has acquired Velocity, an AI SRE company whose technology and team share our passion for transforming how enterprises keep the digital world running. Velocity brings deep expertise in Site Reliability Engineering (SRE) and major incident response, developed alongside some of the world’s most sophisticated technology organizations.

Why Agentic AI Adoption Is Accelerating in Europe and What Comes Next

Across Europe, the cautious optimism business leaders held towards AI agents has evolved into more widespread enthusiasm. What was once a curiosity is now core to how many European organizations operate, respond, and innovate. According to PagerDuty’s latest agentic AI survey, three-quarters or more of organizations in France, Germany, and the UK are deploying multiple AI agents. This growing confidence reflects a broader trend.

How to Choose an AI SRE Solution

The AI SRE landscape has exploded over the past year, with vendors racing to add artificial intelligence capabilities to their platforms. For engineering leaders evaluating these solutions, the sheer number of options can feel overwhelming. Some vendors are building AI-native solutions from scratch, while others are retrofitting AI onto existing workflows. Cloud providers are embedding agents into their ecosystems, and observability platforms are adding intelligence layers to their telemetry data.

Detecting an AWS Outage and DR Lessons

A few weeks ago, on 20th October 2025, AWS suffered a widespread outage in its US-EAST-1 region that affected a large number of customers globally. More than 1,000 apps and websites were impacted including major banks and popular games, streaming and social platforms such as WhatsApp, Snapchat, Fortnite and Pokémon Go.

Jira Service Management (JSM) Review for On-Call Management (2025)

OpsGenie is shutting down. And Atlassian recommends migrating to Jira Service Management (JSM). But if you’re not sure JSM is the right fit for your team’s on-call management needs, this review will help you decide. I signed up for JSM and put it through real-world testing. I created on-call schedules, rotations, and overrides. Then, I reviewed JSM’s on-call management across 4 key criteria. For each criterion, I shared what I liked and what I didn’t.

How Rootly works with Slack | An end-to-end demo.

Rootly is the AI-native on-call and incident management platform that helps you resolve incidents faster, improve system resilience, and streamline on-call operations. It’s your always-on SRE copilot that automates root cause analysis and identifies patterns that drive continuous improvement—trusted by thousands of companies like LinkedIn, NVIDIA, Replit, Elastic, Canva, Clay, Tripadvisor, and Grammarly.
Sponsored Post

Preparing for cloud failures: Monitoring strategies for distributed hybrid infrastructure

When AWS experienced its recent outage, the ripple effect was immediate. Critical workloads slowed, dashboards went blank, and many teams realized multi-cloud isn't automatically resilient. Cloud-level failures are inevitable due to the interdependent components and complex IT architecture. The recent AWS disruption reminded many teams that the cloud isn't a magic uptime guarantee. Even the most mature providers can-and do-experience large-scale service interruptions.

Event Flows: Deep dive into feature

Managing alert routing in complex environments is hard. When events occur, alerts must reach the right people at the right time, but traditional alert sources struggle with sophisticated, context-aware routing. Event Flows is ilert’s node-based workflow system at the heart of our alerting infrastructure. It enables intelligent event processing, time- and context-based routing, and safe automation, so teams reduce alert fatigue and accelerate incident response. ‍

Service Observability, Service Operations and Service Orchestration: Unifying Visibility and Action Across the Enterprise

For large enterprises, the health and resilience of Business Services define customer experience and business reputation. Yet as technology estates grow in complexity, fragmented toolsets and siloed teams make it difficult to maintain service availability and prevent incidents before they impact the business and ultimately, customers.