Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Beyond the Headlines: The Unsung Art of Software Outage Management

Today, the entire world is feeling the pain of a major software outage. While we know a lot about these occurrences—our entire business is built on helping companies manage incidents and outages effectively—we’re not here to share our opinion on it. Instead, we’d like to help those unfamiliar with the incident lifecycle understand what happens when an outage like this occurs, who is responsible for what, and what companies ultimately do to get things working again.

Learning Moment: Effective Customer Communication During Incidents - Enhance Visibility & Response with Uptime.com

The recent global outage caused by an operating system update reminded me of how vulnerable we are today and most importantly, how close we are always teetering on global scale incidents with millions of interconnected dependencies. When the base of the house collapses, everything built on top is impacted. Those of us in IT Operations, Monitoring, Observability (insert the current acronym), etc., know firsthand this risk; we face it every day.

Time, timezones, and scheduling

Our On-call product has been in the wild for a few months now, and in this post I want to talk about building a time-sensitive system and what we did to handle some of the challenges. I’ll cover what our scheduler is responsible for, the basics of working with time, and talk a bit about how we tested our system.

What is ServiceOps?

Service operations (ServiceOps) is a technology-enabled approach that unifies IT operations and IT service (ITSM) teams and facilitates frictionless collaboration for more effective incident management. ServiceOps combines people, processes, and technology to improve visibility, workflows, and collaboration between otherwise siloed departments. Organizations of all sizes and industries worldwide have adopted ServiceOps.

The Impact of On-Call on Mental Health

Lately, I have been thinking about the mental health effects that stem from working in the cybersecurity industry. And in my research, I came across an Afternoon Cyber Tea podcast that sparked my interest. During their talk, host Ann Johnson and Dr. Ryan Louie, MD, PhD, dissect parallels between those who work in cybersecurity and those who work in healthcare, and uncover how these types of jobs affect mental health.

Automating SLO Management: Boost Efficiency, Accuracy, and Reliability

82% of organizations plan to increase their use of Service Level Objectives (SLOs), with 95% reporting that SLO adoption drives better business decisions, according to the Nobl9 2023 State of SLOs report. The traditional manual management of SLOs often results in inefficiencies and human errors, hindering productivity. Automating SLO management transforms these processes, enhancing accuracy and operational efficiency.

The complexity of phone networks

Arguably the most important part of an on-call product is knowing that you will be notified when things break, wherever you are. When it comes to SMS and phone call notifications, we have to leave the familiar realm of the internet and JSON responses, and deal with systems that provide limited observability and insight into what’s gone wrong.