Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Exploring distributed vs centralized incident command models

Recently in our Better Incidents Slack channel, there’s been some chatter around how people structure dedicated incident commanders at their company: distributed or centralized. The way I see it, there are two types of commanders: the temporary, distributed role — a hat that an on-call engineer or an engineering manager puts on during an incident. Then there’s the centralized, full-time role, where someone is the designated incident commander (or one of a few) for all incidents.

Custom fields: make FireHydrant your personalized incident management platform

Today we're releasing custom fields, a powerful new feature that empowers you to tailor FireHydrant to your organization's specific needs and capture essential incident details. Custom fields help you track critical states, involved parties, resolution specifics, affected services, messages, and more — almost anything you want! — all aligned with your unique workflows. Regardless of the size of your team or the maturity of your processes, custom fields adapt to your workflow.

210% ROI: unlocking the economic value of FireHydrant for incident management

In the fast-paced high-tech industry, efficient incident management is a critical factor in maintaining brand reputation, employee morale, and most importantly, your bottom line. Good practices can result in reduced downtime, increased learning opportunities from incidents, and an enhanced reputation among both the engineering community and customers. But quantifying the true cost of incidents has always been a challenge — until now.

Align platform and product engineering teams over incidents

I firmly believe in never letting a good incident go to waste. Incidents expose weak spots and create opportunities for medium and long-term investments. In analyzing incidents and understanding their root causes, organizations can identify areas that require additional resources or enhancements. When incidents are used to align your platform and product engineering, it opens up opportunities to enhance the performance and security of your product.

Incident severity: why you need it and how to ensure it's set

Defined severity levels quickly get responders and stakeholders on the same page on the impact of the incident, and they set expectations for the level of response effort — both of which help you fix the problem faster. But sometimes, for whatever reason, a severity level just doesn’t get set. Maybe there’s confusion around what severity level to use. Or maybe you have a low barrier to declaration and your responders just need a little nudge.

Upgraded role-based access control brings more visibility - and control - to incident management at your organization

We’ve long believed that incidents (and technical team cultures) improve when more people are empowered to declare, follow, and contribute to their resolution. But not everyone in an organization needs to be able to manage the processes, rules, and settings a company enforces for their incident programs.

FireHydrant Private Incidents & Runbooks: more control for you, more security for your customers

Ensuring the privacy and security of sensitive information is crucial no matter your company's size or industry. So when an incident comes up that includes sensitive information — Personal Identifiable Information (PII), financial data, accidental data breaches, or legal matters requiring privileged communication — your response process might need a higher level of security and discretion.

The "people problem" of incident management

Managing incidents is already tricky enough, and you want to get to mitigation as quickly as possible. But sometimes it feels like organizing everything surrounding an incident is more difficult than solving the actual technical problem and you end up getting delayed or sidetracked during mitigation efforts. We call that scenario the “people problem” of incident management.

New related incidents functionality brings order to the chaos of highly complex incidents

We’ve all been there. You’re working through some rather frustrating blockers during an incident only to discover that you don’t own the dependency at fault. Or, you’ve been pounding away at an issue when a fellow engineer reaches out and asks if your service is affected by some particularly gnarly database failure they’re seeing. But then what? Do you merge efforts and work in parallel or head for a coffee break while the issue gets attacked upstream?

Using PostgreSQL advisory locks to avoid race conditions

The first moments of incident response can be among the most crucial, which in turn can also make them among the most stressful. There are many ways to ensure incidents are kicked off smoothly, but a recent focus of ours was to ensure they could be kicked off quickly. After all, the faster you're able to start mitigating your incident, the more successful you'll be!