Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Improve IT Operations with Response Analytics

Your IT team just finished resolving a complex incident, customer service finished their last call about the issue, and your business is back to being fully operational. Now that the storm has passed, you should be planning a postmortem to determine the cause of the incident and lessons learned. Postmortems require specific data that can highlight where your team is succeeding and where they can improve.

What's behind BigPanda's customers' success?

As the Regional VP of Customer Success for the West and Central Region at BigPanda, Chris LaPierre gets a unique opportunity to see first-hand how BigPanda customers use their AIOps platform. Charged with ensuring every BigPanda customer derives high value and return on investment from the solution, BigPanda’s customer success teams make certain customers leverage the AIOps platform to increase their bottom line.

Outage Alert: Top 5 Outages of Q1 2022

By now it’s no secret that system outages and website downtime are more widespread and frequent than ever. In fact, the frequency of outages jumped 9% in just the first week of 2022. This can be attributed to a rapid increase in traffic and reliance on tech infrastructures – resulting in connectivity, server, and other technical issues that are alternately unforeseen and unavoidable.

Managing Burnout | Tips To Minimize The Impact

Burnout is real. Today, the source of burnout can be anything from pandemic fatigue, to the onslaught of political divisiveness, or simply the pace of life worldwide. Whatever the culprit, we’re living in a stressful time. People working in cloud native environments definitely feel burnt out. Silicon Valley investor Marc Andreessen famously said, “Software is eating the world,” and that seems to be quite true. High demand is fueling churn. System and cloud operators feel pressure.

Accelerate incident investigations with Log Anomaly Detection

Modern DevOps teams that run dynamic, ephemeral environments (e.g., serverless) often struggle to keep up with the ever-increasing volume of logs, making it even more difficult to ensure that engineers can effectively troubleshoot incidents. During an incident, the trial-and-error process of finding and confirming which logs are relevant to your investigation can be time consuming and laborious. This results in employee frustration, degraded performance for customers, and lost revenue.

The Pros and Cons of Embedded SREs

To embed or not to embed: That is the question. At least, that’s one of the questions that companies have to answer as they decide how to implement Site Reliability Engineering. They can either embed SREs into existing teams, or they can build a new, separate SRE team. Both approaches have their pros and cons. The right strategy for your company or team depends, of course, on your needs and priorities.

Product update: ensure consistent data across all your retros with two new features

FireHydrant captures your incident, from declaration through remediation, and gives you a framework to run your retrospectives. But retrospectives are only as effective as their inputs. Now we're delivering a better way to learn from and analyze retrospectives by guaranteeing consistent, structured, and sufficient data from your team.

OnCallogy Sessions

Being on call is challenging. It’s signing up to be operating complex services in a totally interruptible manner, at all hours of the day or night, with limited context. It’s therefore critical to have proper on-call on-boarding procedures, offer continuous training sessions, and continuously improve documentation. We also need to make sure people feel safe by providing ways to reduce their stress, and make room for questions to surface all sorts of uncertainties around our operations.

Conflict Management and the Major Incident Management Process

Major incidents are, by their very nature, stressful and intense. The ITIL 4 definition of a major incident is: High-stress situations can cause conflict that left unchecked could delay the fix effort. Since we already have a definitive guide on incident management, this blog post will focus specifically on the major incident management process.