FireHydrant

firehydrant

Announcing Our Series A

It’s Friday at about quitting time, and my plans for the evening involved a great cocktail, hanging out with friends, and maybe continuing to binge The Office. Sadly, there was a problem. Our alerting system detected an enormous and immediate spike in errors. The error description was along the lines of “table ‘servers’ does not exist” and thousands of customers couldn’t use a large cloud provider’s services.

firehydrant

Failover Conf Wrapup

Failover Conf was held on April 21, 2020, online. The folks at Gremlin came up with the idea of a virtual conference about reliability after many in-person conferences started being postponed or canceled due to COVID-19. The conference was a lot of fun to attend. I’ll be sharing some of my thoughts on the event and the talks I was able to catch. The videos for the talks haven’t been posted yet, but I’ll update this post with links to them when they are.

firehydrant

Advice for On-call Teams During COVID-19

I’ve offered some tips up for folks who are oncall during the COVID-19 crisis, but I thought it would be helpful to get some more ideas from people with different perspectives. So I reached out to some people I trust to see what they had to say. They all have different viewpoints, but some themes emerge, like managing alerts, having empathy, and practicing self-care. The participants, in alphabetical order: Aaron Aldrich is a Developer Advocate at LaunchDarkly, with a focus on DevOps.

firehydrant

Q&A with Alex Hidalgo on SLOs

Alex Hidalgo is a Site Reliability Engineer at Squarespace, and he’s currently writing a book called Implementing Service Level Objectives for O’Reilly Media. The first three chapters of the book are available now through O’Reilly’s early access program. I had a chance to read those chapters and ask Alex some questions about service level objectives and reliability. Thanks, Alex, for sharing your knowledge.

firehydrant

Announcing Ticketing

Incidents come up quickly and tracking critical tasks to be done in the moment and after an incident is resolved it can be challenging to keep up with what was done by who during an incident and what tasks still need to be completed. In an effort to continue simplifying your incident response process today we are happy to announce an overhaul of ticketing and task tracking on FireHydrant along with a major overhaul of our JIRA integration.

firehydrant

Make the most from FireHydrant's Service Catalogs with these 4 tips

Outages are inevitable. It is how we respond that can make or break our company. In this post, we will talk about how Service Catalogs can impact your incident response process and make it more effective. When a company has just a handful of services, it can be relatively easy to figure out who to call when something breaks. But when companies are at the stage of having dozens of services to manage, figuring out who to page or reach out to can be a challenge.

firehydrant

3 Ways to Help CS and Engineering Work Better Together

As Engineering teams start spending more time and effort on incident response, they are usually focused on improving process with their specific team. We think there are additional benefits that can come from a holistic approach to improving incident response across your organization. In this post, we will explore how you can enable Engineering and Customer Success teams to work more effectively when an incident occurs.

firehydrant

Severity Matrix Updates

We’re on a mission to make responding to incidents a bit less chaotic. One of the best features we offer (we’re definitely not biased, no way) is a simple way to define how a severity gets determined when you open an incident. We call it the severity matrix, and today it has a new look. Previously, we had a preset list of conditions and impact that allowed you to pick a severity that matched them.

firehydrant

Announcing Runbooks

Since the beginning, we’ve wanted to make it faster, easier, and even a joy to respond to incidents. We’ve had the typical components of incident response for a while, but orchestrating them together was a manual task by our users. Today we’re marrying together all the features already available in our incident response tool into our newest release: Runbooks.