Operations | Monitoring | ITSM | DevOps | Cloud

FireHydrant

LaunchDarkly Improves Incident Response with FireHydrant

Headquartered in Oakland, California, LaunchDarkly is a feature management platform that empowers all teams to safely deliver and control software through feature flags. By separating code deployments from feature releases, LaunchDarkly enables teams to deploy faster, reduce risk, and iterate continuously. Over 1000 organizations use LaunchDarkly to build, operate, and learn from their software.

Sticking to Your SLAs with FireHydrant Runbooks

In today’s world, systems are increasingly becoming more and more complex. Due to this complexity, it’s no longer a matter of “if” our systems will fail but “when”. To manage expectations for when our systems do fail, we can look no further than our Service Level Agreement.

Oncall and COVID-19 Survey Results

One of my concerns as COVID-19 took hold in the US was what the impact on teams that are oncall in tech would be. It can be extremely challenging to be oncall during a “normal” time, and this has been anything but normal. So, I decided to create a survey to learn more about what people’s experiences have been. The survey was conducted from April 8 to April 27, 2020, via a Google Form. It was anonymous and had 141 respondents.

Announcing Our Series A

It’s Friday at about quitting time, and my plans for the evening involved a great cocktail, hanging out with friends, and maybe continuing to binge The Office. Sadly, there was a problem. Our alerting system detected an enormous and immediate spike in errors. The error description was along the lines of “table ‘servers’ does not exist” and thousands of customers couldn’t use a large cloud provider’s services.

Failover Conf Wrapup

Failover Conf was held on April 21, 2020, online. The folks at Gremlin came up with the idea of a virtual conference about reliability after many in-person conferences started being postponed or canceled due to COVID-19. The conference was a lot of fun to attend. I’ll be sharing some of my thoughts on the event and the talks I was able to catch. The videos for the talks haven’t been posted yet, but I’ll update this post with links to them when they are.

Advice for On-call Teams During COVID-19

I’ve offered some tips up for folks who are oncall during the COVID-19 crisis, but I thought it would be helpful to get some more ideas from people with different perspectives. So I reached out to some people I trust to see what they had to say. They all have different viewpoints, but some themes emerge, like managing alerts, having empathy, and practicing self-care. The participants, in alphabetical order: Aaron Aldrich is a Developer Advocate at LaunchDarkly, with a focus on DevOps.

Q&A with Alex Hidalgo on SLOs

Alex Hidalgo is a Site Reliability Engineer at Squarespace, and he’s currently writing a book called Implementing Service Level Objectives for O’Reilly Media. The first three chapters of the book are available now through O’Reilly’s early access program. I had a chance to read those chapters and ask Alex some questions about service level objectives and reliability. Thanks, Alex, for sharing your knowledge.