Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How observability and AIOps work better together

If you’re juggling complex, cloud-based, containerized systems and aiming to meet high customer expectations, your old monitoring processes probably don’t cut it anymore. Increasing infrastructure complexity means you need to instrument more, log more, and monitor more. That leads to even more complexity. The answer is better observability, right? Yes and no. Observability and monitoring are critical, but they are only part of what you need for service awareness and availability.

Captains Log: A first look at our architecture for Signals

Welcome to the first Signals Captain’s Log! My name is Robert, and I’m a recovering on-call engineer and the CEO of FireHydrant. When we started our journey of building Signals, a viable replacement for PagerDuty, OpsGenie, etc, we decided very early that we would tell everyone what makes Signals unique, and what better way than to tell you how we’re building it (without revealing too much 😉). Let’s jump in.

The New SEC Rules and You

The Securities and Exchanges Commission published new rules for SEC registrants around disclosing incident details and response policies. Compliance with these new rules should be top of mind for any company – even if your org hasn’t hit the milestone of registering with the SEC, you should be prepared to be compliant when you take that step. ‍

What you need to know about the The Digital Operational Resilience Act (DORA)

The European Commission has introduced the Digital Operational Resilience Act (DORA) to bolster the digital infrastructure of the financial sector within the European Union (EU). As part of the EU's wider digital finance strategy, DORA's objective is to create a comprehensive framework governing digital operational resilience. Financial institutions must ensure full compliance with DORA by January 2025.

Mastering Root Cause Analysis: A Guide for Site Reliability Engineers

Site Reliability Engineers (SREs) play a vital role in ensuring the stability and performance of web services and are key in incident management. One of the core skills SREs need is the ability to conduct effective Root Cause Analysis (RCA) when issues arise. This guide is about how to improve your RCA skills for more effective post-incident analysis.Let's dive in.🔖 What is Prometheus Alertmanager? Read here!

The Unplanned Show, Episode 19: Cloud Security response with Ashley Ward

As organizations move to the cloud, where is there overlap between security and IT and engineering? In this session, Dormain will sit down with Orca Security's Principal Technical Evangelist, Ashley Ward, to learn about how working practices have to evolve with the speed of change in the cloud.

What is IT incident management - and how can AIOps optimize it?

Imagine you’re in the middle of a critical project, and suddenly, your system crashes. Or perhaps it’s the middle of the night, and your server goes down, affecting countless users. Some IT incidents are inevitable, but the way you manage them makes all the difference in minimizing their impact. You know that proper incident management is critical – and that incidents can become costly.

How we manage incidents at Datadog

Incidents put systems and organizations to the test. They pose particular challenges at scale: in complex distributed environments overseen by many different teams, managing incidents requires extensive structure and planning. But incidents, by definition, break structures and foil plans. As a result, they demand carefully orchestrated yet highly flexible forms of response. This post will provide a look into how we manage incidents at Datadog. We’ll cover our entire process.