Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Beyond Connectivity: The Expanding Role of APIs in DevOps and Incident Management

In today’s hyperconnected world, APIs are no longer just tools for integrating software—they are the driving force behind modern DevOps and incident management strategies. As organizations prioritize speed, scalability, and resilience, APIs have transformed from being enablers of connectivity to essential components in streamlining workflows, improving collaboration, and accelerating incident resolution.

Honeybadger and ilert: smart incident response

We're thrilled to announce a native integration with ilert, combining Honeybadger's full-stack application monitoring with ilert's real-time alert routing and on-call management platform. ilert handles alert routing, escalations, and on-call scheduling, ensuring critical issues always reach the right person at the right time.

Survey: 88% of Execs Expect an Incident as Large as the July Global IT Outage Within the Next Year

By Debbie O’Brien, Chief Communications Officer and Vice President of Global Social Impact at PagerDuty In today’s digitally-connected world, IT outages can be inconvenient at best and extremely challenging at worst.

New ServiceNow Integration (Beta) Powers More Efficient ITSM

Today, we’re excited to announce the release of our new ServiceNow integration in beta — designed to give engineers even more control to manage and automate incidents in FireHydrant while seamlessly keeping the rest of the organization aligned in ServiceNow.

Home Call Survival Guide

Whether it’s your first or hundredth home call shift, preparing yourself both physically and mentally is crucial. These shifts can be unpredictable, demanding, and emotionally taxing, making it essential to prioritize your well being while maintaining your readiness to provide the best possible patient care. By adopting effective time management, organization, and healthy strategies, you can confidently navigate the unique challenges of home call shifts. Key Takeaways (TL;DR)

Update December 2024 - Intelligent event filters and enhanced manual alarm distribution

In our December update, we have significantly revamped and improved manual alerting. If you need to carefully evaluate incidents before distributing them manually to the respective teams or want to send critical operational updates to relevant personnel, you’ll love the new features we’ve introduced! Additionally, we’ve added intelligent filtering options for automatically incoming events.

Reducing noise: configuring alert processing with Terraform

With increasing numbers of alerts, keeping focus on the important and most critical alerts proves to be more and more of a challenge. A reduction of alert noise, meaning the prevention of too many created alerts and any kind of user notifications, is needed to ensure efficient alert response. While a detailed explanation of this topic is given in this blog post, a flexible and automated setup for your relevant resources can be achieved with Terraform using the ilert Terraform provider.

What is MTTR and How Does It Impact Your Bottom Line?

Mean time to repair (MTTR), sometimes referred to as mean time to resolution, is a popular DevOps and site reliability engineering (SRE) team metric. MTTR identifies the overall availability and disaster recovery aspects of your IT assets or application workloads. The acronym MTTR can cause some confusion since it has different meanings across different industries. Sometimes, MTTR refers to mean time to respond: the amount of time needed to react to a problem.