The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.
Today, we’re announcing major new updates to Honeycomb’s PagerDuty integration. These updates put more of the information you need into PagerDuty notifications and allow for greater configurability. These enhancements are available to all users who leverage Honeycomb Triggers and Burn Alerts to send notifications via PagerDuty.
What’s just as important as resolving an impacted service? Providing detailed yet digestible updates to your communities and stakeholders. A recent update to StatusCast, involves the addition of three new status types that can be assigned to your components. Detailed communications is an essential component of incident response and management, and additional status types provide your users with a more granular view of incident activity.
How are you tracking the long-term operation and health indicators for your micro and macro services? Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are prized (but sometimes “aspirational”) metrics for DevOps teams and ITOps analysts. Today we’ll see how we can leverage SignalFlow to put some SLOs Error Budget tracking together (or easily spin up same with Terraform)!
World-class incident responders are a strategic asset in today’s world where the frequency and sophistication of cyber security attacks continue to increase every year, as do the associated financial damages: As such, more and more organizations are looking to grow their cyber incident response expertise, both with inhouse staff as well as by engaging with third-party experts.
We've all done it: "that'll be simple, I'll just write a quick script and..." In the case of calculating on-call pay, we really have done it before: our team have built the on-call pay scripts for several companies, and each attempt was a painful, error prone process. While we believe everyone on-call should be paid for their inconvenience, relying on someones side-project or back-of-napkin maths to calculate pay leads to mistakes, frustration, and wasted time.
We get it – incidents happen. What differentiates resilient teams from others is how they learn from them: using them as an opportunity to find the biggest improvements in how they work. Incident timelines are one of the most simple and effective tools available to you when it comes to learning from an incident. It’s vital that you ensure they’re accurate and useful, in order to make the biggest improvements after an incident.
Companies can take big strides toward “preventing preventable” incidents by minding what they measure. What’s in a name? In Measuring what matters, one of the panels at our RESOLVE ‘22 event, the three words in the title reflect a plan successful IT Ops teams have embraced to reduce the complexity of their reporting systems—resulting in a faster path for companies to make more effective use of all the IT resources at their disposal.