Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Frontline Reliability: Protecting User Journeys with SLOs with Shery Brauner (Razor, ex-Zalando)

What does it really take to move from firefighting incidents to building reliability at scale? In this episode of Humans of Reliability, Shery Brauner (Razor, ex-Zalando) shares her unique journey from frontend and backend engineering to leading site reliability practices. She explains why protecting the user journey is the key to effective incident management, how SLOs cut through noisy alerts, and why observability must come first.

Incident post-mortems: the complete, blameless guide

Most companies run post-mortems like autopsies. They dissect the corpse, assign blame, and file it away. The body count keeps rising. Here's what actually works: post-mortems as learning machines. Systems thinking over finger-pointing. Patterns over pain. What you'll get: A copy-paste template, real metrics that matter, and the mindset shift that turns outages into intelligence. Who this is for: SRE leads tired of repeating incidents. Engineering managers who want learning over theater.

Part Two - Event Intelligence vs. AIOps: Key Differences, When to Use Each and Why

The IT environments of large enterprises have become so complex that operational teams have turned to two solution categories in particular to help them improve visibility and gain faster incident response, automate and enable more effective decision-making.

Improving the Developer Experience by Monitoring Third-Party Outages

The role of third-party SaaS and cloud services in the modern software development stack needs no explanation. Primarily due to the ease of setting up and hooking them together, they make the software development lifecycle (SDLC) much easier than it was 10 years ago. No more managing the overhead of installing, configuring, maintaining, backing up, and scaling of source code repos, virtual machines, and CI/CD systems. Some services don't have any in-house options, e.g. payment gateways.

Quick Start Guide: Setting Up SIGNL4 in Minutes

Getting started with SIGNL4 is fast, easy, and doesn’t require any complex setup. This quick guide walks you through the essential steps – from signing up to sending your first alert and adding team members. In just a few minutes, you’ll have a fully functional, mobile-enabled alerting system ready to keep your team informed and responsive.

How to Build a Strategic Roadmap for Site Reliability Engineering Implementation

Getting your site reliability engineering solutions in place can seriously boost how your systems perform. But implementing site reliability engineering (SRE) isn't a simple flip of a switch-it's a process. If you want to keep your systems running smoothly, with minimal downtime and top-notch performance, you need a solid, strategic plan. This roadmap should guide you step-by-step, from setting clear goals to constantly improving your processes.

It's Time to Connect Your Islands of Automation With AI Agents

Automation has transformed incident response within individual teams. Diagnostic scripts, runbooks, and alert systems help engineers troubleshoot and resolve issues more efficiently. Translating those gains across the organization remains a challenge. Most automations are built in silos and not designed to work together. The result: disconnected workflows, inconsistent outcomes, and too much manual effort, leaving teams with less time for the strategic work that drives innovation and resilience.

ilert AI Voice Agent: Deep dive

‍ The ilert AI Voice Agent is designed to transform how on-call engineers handle urgent calls. Instead of waking engineers at 3 a.m. with minimal context, the AI Voice Agent collects essential details first and routes calls intelligently based on relevant, up-to-date information. ‍ The agent works hand in hand with ilert’s Call Flow Builder – a visual tool that lets users design custom call flows by connecting configurable nodes.