Working for VictorOps, and now Splunk, has allowed me to experience the on-call process from several distinct angles. And, working in a customer-facing role, I’ve witnessed the full spectrum of DevOps maturity – from downright DevOps mastery to the kinds of nightmare scenarios that haunt the dreams of on-call professionals.
Alert noise is a very common on-call complaint leading to fatigue and on-call burnout. This article is an attempt at helping folks address this problem.
It sucks to be on-call when processes are not well defined and streamlined. Especially around the holidays. You really don't want to hear your phone repeatedly going off right when you're sitting for Christmas dinner with your loved ones or getting to unwrapping the good presents (the ones with the sparkly wrapping paper :P). Your on-call team’s stress levels reflects the health of your system, the cleanliness of your code and the culture of your organization.
At Sensu Summit 2019, Adam Westman, Sr. Engineering Manager at Target, introduced us to GoAlert, their on-call scheduling and notification open source project. In this post, I’ll recap his talk, sharing the journey that led them to build GoAlert, the problems they’ve solved, and how they use GoAlert with Sensu Go to simplify monitoring and reduce alert fatigue.
“Being on-call is a critical duty that many operations and engineering teams must undertake to keep their services reliable and available. However, there are several pitfalls in the organization of on-call rotations and responsibilities that can lead to serious consequences for the services and the teams if not avoided.
Hi there! I’m happy to announce a new major update of our In-Slack Incident Management App Amixr.IO. It’s already live for all users and available in all subscription plans
Alerting has come a long way from the days of paging an on-call administrator in the middle of the night, to multiple on-call teams that run and manage incident response around the clock. This is because as organizations grow and scale, responding to incidents also gets more complex and you often need more than one team to get involved to successfully resolve an incident.
Modern software production stops for no one, and everyone is needed to keep it rolling. Every dev is on-call. Great speed and friction produce a lot of heat, and when everything is on fire all the time, even the best devs and engineers struggle to keep the train speeding onwards without getting burned. What makes maintaining modern production so hard? And what is the difference between being good and being bad at dev-on-call? Let’s dive in and see.
In a world of highly-integrated systems, microservices, cloud infrastructure and constant development, DevOps and IT teams are tasked with finding better ways to keep up with their own processes. By actively testing throughout the development lifecycle and preparing for incident response, you’ll build more resilient services up front while simultaneously being prepared when things go south.