Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

The new principles of incident alerting: it's time to evolve

In the ever-evolving world of software engineering, the landscape is constantly shifting. New technologies emerge, best practices evolve, and how we build and run software continues to change. However, when it comes to incident alerting, it often feels like we're stuck in the past.

Incident management vs problem management

The video explores the difference between incident management and problem management in modern organizations. It describes a common scenario where operations teams focus on immediate fixes, like rebooting systems, without addressing the root causes. Once the immediate issue is resolved, these teams pass the incident report to the developers, who are then responsible for digging deeper to prevent future occurrences.

Generative AI for IT Operations: Your Questions Answered

IT leaders are thrilled about the potential of Generative AI for IT Operations. But they also want to know how it works, why it works, and what it will do for them before taking the leap and adopting this new technology. Allow me to share my perspective on the hype and the truth behind Generative AI. I’m the Field CTO for BigPanda, Operational Intelligence and Automation driven by AIOps.

Observability Pillars: Exploring Logs, Metrics and Traces

The ability to measure the internal states of a system by examining its outputs is called Observability. A system becomes 'observable' when it is possible to estimate the current state using only information from outputs, namely sensor data. You can use the data from Observability to identify and troubleshoot problems, optimize performance, and improve security. In the next few sections, we'll take a closer look at the three pillars of Observability: Metrics, Logs, and Traces.