Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

The Unplanned Show, Ep. 29: Major Incident Management with Davis and Chris

Not all incidents are created equal. How do you handle major incidents so that they don't spiral into a chaotic mess, incinerating productivity across too many teams? How do you prevent major incidents and learn from the ones you've had? "Major Incident Management" has been a practice for a long time, but as companies depend even more on digital services and revenue channels, while trying to do more with the same or less, something has to change.

Navigating the Evolving Landscape: A Deep Dive into REST API Versioning Strategies

In the ever-evolving landscape of APIs, ensuring seamless interactions and managing changes becomes crucial. While innovation and adaptability are essential, maintaining backward compatibility is equally important to avoid disruption for existing users. This is where REST API versioning comes into play. Versioning allows you to introduce new features or changes to your API in a controlled manner, while simultaneously keeping older versions running smoothly.

Negotiating Priorities Around Incident Investigations

There are countless challenges around incident investigations and reports. Aside from sensitive situations revolving around blame and corrections, tricky problems come up when having discussions with multiple stakeholders. The problems I’ll explore in this blog—from the SRE perspective—are about time pressures (when to ship the investigation) and the type of report people expect.

Combating IT Alert Fatigue

With the growing complexity of IT systems, managing alerts and notifications without succumbing to the crippling effects of alert fatigue has never been more challenging. Alert Fatigue occurs when the volume of notifications makes it impossible to discern signal from noise, desensitizing the recipient to warnings, some of which end up representing critical issues.

Welcome to StatusCast Demo Video!

Welcome to the StatusCast Demo Video – Your Ultimate Guide to Seamless Status Communication! 🚀 About StatusCast: StatusCast is a cutting-edge platform designed to revolutionize the way you communicate service status and incidents to your users. Whether you're a tech company, SaaS provider, or any organization relying on online services, StatusCast ensures that you keep your users informed, engaged, and satisfied.

Finally: alerting and on-call scheduling for how you actually work

TL;DR You deserve a better alerting and on-call tool. So we built Signals. In our early days, we often used the tagline, “You just got paged. Now what?” It encapsulated how FireHydrant solved for all of the messy bits that come after your alert is fired, from incident declaration all the way through to retrospective. At the time, we saw alerting and on-call scheduling as a solved problem.

Integrating Prometheus AlertManager with PagerDuty in Calico

In the fast-paced world of Kubernetes, guaranteeing optimal performance and reliability of underlying infrastructure is crucial, such as container and Kubernetes networking. One key aspect of achieving this is by effectively managing alerts and notifications. This blog post emphasizes the significance of configuring alerts in a Kubernetes environment, particularly for Calico Enterprise and Cloud, which provides Kubernetes workload networking, security, and observability.