Operations | Monitoring | ITSM | DevOps | Cloud

FireHydrant

3 Ways to Help CS and Engineering Work Better Together

As Engineering teams start spending more time and effort on incident response, they are usually focused on improving process with their specific team. We think there are additional benefits that can come from a holistic approach to improving incident response across your organization. In this post, we will explore how you can enable Engineering and Customer Success teams to work more effectively when an incident occurs.

Severity Matrix Updates

We’re on a mission to make responding to incidents a bit less chaotic. One of the best features we offer (we’re definitely not biased, no way) is a simple way to define how a severity gets determined when you open an incident. We call it the severity matrix, and today it has a new look. Previously, we had a preset list of conditions and impact that allowed you to pick a severity that matched them.

Announcing Runbooks

Since the beginning, we’ve wanted to make it faster, easier, and even a joy to respond to incidents. We’ve had the typical components of incident response for a while, but orchestrating them together was a manual task by our users. Today we’re marrying together all the features already available in our incident response tool into our newest release: Runbooks.

A single person on-call "rotation" is a critical vulnerability

One of the most common complaints we hear from operations and site reliability engineers is about the quality of life impacts and the resulting stress imposed by their on-call responsibilities. Most of us are already aware that a proper on-call rotation is critical to our engineering organization’s health in terms of both immediate incident response and long-term sustainable growth.

Open Source can be a silver bullet, but your application might be a werewolf

I was reminiscing about an incident that happened at a past job with an old co-worker. You know the one, the one where you installed a library that makes some task of yours simple, only to reveal the library makes things worse. This incident in particular involved the way that images served out of our Ruby on Rails application, and the library that made it possible to “easily resize before serving” them.

Announcing our AWS CloudTrail Integration

One of the most common reasons for system failures is changes to the underlying infrastructure. Amazon CloudTrail does a great job of recording when actions are taken but a lot of organizations don’t take advantage of it. FireHydrant now includes this data, giving you visibility into changes to your infrastructure while you’re investigating an incident.

Dynamic Kubernetes Informers

In the past I’ve written about how to use informers in Kubernetes for particular resources, but what if you need to be able to receive events for any Kubernetes resource dynamically? Well, there’s a client-go package for that too. At FireHydrant, we recently updated our Kubernetes integration to watch changes for any resource you configure and I wanted to write down how we made it at a high level.

Announcing our Statuspage.io integration

Ever go to a status page and it says everything is operational when it definitely isn’t? You refresh maddeningly thinking it might be you. You ponder if the bill for the internet has been paid. Then, as a last resort, you check Twitter only to discover hundreds of people are experiencing the same problem. This is common, and because of it, we’re happy to release out integration with Statuspage.io!

3 Defensive Programming Techniques for Rails

Incidents happen all the time because of bad code deploys. You write some code that passes code review, it then is automatically shipped to production after a test suite passes, and BAM, an outage happens. This fairly common occurrence has ways to prevent it entirely. Using some simple ideas we can defend ourselves from the hidden mistakes that code reviews and chaos engineering sometimes won’t catch.

Announcing Flare: Make opening incidents stress free

We’re launching a new feature today that allows anyone in your organization to kick off your incident response process with an appropriate severity level attached from Slack. Often people are afraid to open an incident or even share that they’re aware of something going wrong with your applications. When everything is important, nothing is important; users frequently overestimate the impact of an incident and assign an inappropriately high severity level.