Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

The Unplanned Show, Episode 4: Sriram Subramanian on Responsible Generative AI

Generative AI is a rapidly-evolving ecosystem with a lot of attention. In this episode, Dormain Drewitz asks Sriram Subramanian about the main challenges to responsibly implement generative AI, including content that’s harmful, inaccurate or violates privacy or security standards. Sriram discusses Microsoft’s 6 tenets to responsible generative AI, as well as the notion of shared responsibility between platform providers and foundational LLMs and the developers and data engineers building on top. Sriram also answers questions about where to get started safely with generative AI and shares his framework for identifying opportunities to add value.

Improve Visibility and Capture More Data with Triage Incidents

As new incidents emerge, there are often many unknowns about the size, severity, and cause of the problem. Sometimes it’s not clear if the problem is an incident at all. That’s where introducing a triage stage to your incident management process can help. In this post, we’ll look at the benefits of adding a triage layer to your incident management, and how Rootly’s Triage feature allows you to seamlessly transition from triage to real incident (or false alarm).

Unleash the true power of AIOps with BigPanda New Generative AI

IT response teams find themselves battling against an overwhelming onslaught of incidents. Frustratingly long response times, challenges with prioritization, and the relentless pursuit of root cause are formidable adversaries that test even the most skilled teams. I remember customers’ electrifying anticipation with AI and automation a decade ago. They hoped AI could be used to instantly decode the business impact of incidents and automation to respond to incidents without human intervention.

PagerDuty Extends Operations Cloud Leadership into AIOps and Automation

Forrester Names PagerDuty a Leader in first-ever Process-Centric AIOps Wave From helping pioneer the DevOps movement to establishing best practices around service ownership to being the standard in incident response, PagerDuty has a long history of leadership. PagerDuty is honored to add to this list and now be recognized as a leader in the AIOps and Automation space by Forrester.

The differences between reactive vs proactive incident response

Most commonly, businesses take a reactive approach to incident management. After all, the concept of incident response seems inherently reactive. However, it is possible—and often necessary—to take more proactive measures. This entails identifying potential problems and taking steps to remediate them before they become incidents.

Effective incident escalations

In the ever-evolving digital landscape, every organization must confront its fair share of incidents. Regardless of the sector or size, one common thread weaves through them all: the need for effective incident management. A crucial part of this management is incident escalation, a topic on which we've had many discussions with various companies.

5 Takeaways from Gartner's Latest AIOps Analysis

If you’re still unpacking the latest terminology from Gartner’s 2023 AIOps market update, you aren’t alone. Subject matter experts from Moogsoft recently joined thought leaders from TIAA and Windward Consulting for a debrief on the panel interview Accelerating Your AIOps Journey Webinar. Almost half of technology leaders looking to improve productivity and fuel greater collaboration are struggling to explain AIOps use cases, benefits, and value to other business leaders.

Incident severity: why you need it and how to ensure it's set

Defined severity levels quickly get responders and stakeholders on the same page on the impact of the incident, and they set expectations for the level of response effort — both of which help you fix the problem faster. But sometimes, for whatever reason, a severity level just doesn’t get set. Maybe there’s confusion around what severity level to use. Or maybe you have a low barrier to declaration and your responders just need a little nudge.

Sponsored Post

Improve MTBF and MTTR for your Application Platforms by using MESH Observability

When businesses look at how best to understand the performance levels of their platforms, some of the best incident management metrics to look at are Mean Time Between Failures (MTBF) and Mean Time ToResolution(MTTR). These two measurements will give an excellent indication of the health and speed of the system, as well as the ability of the platform to take care of any anomalies that have been detected or to flag them up for others to take action to resolve them.