Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Incident severity and priority 101

Severity and priority can be challenging for a company to nail. When an incident is declared, it's essential to have a system to define the impact and how urgently it should be handled. Incident severity and priority are the two knobs teams can leverage to define scope and urgency, and eventually, the appropriate process to take action. But how should we define them, and what are the differences?

Sponsored Post

What Is a DevOps Toolchain and How Does It Work?

Picture yourself trying to resolve a code error when you notice an additional issue outside your realm of expertise that's making matters worse. Your instinct is to get in touch with the right contact as quickly as possible to resolve the issue so that there's no further impact on the system's uptime. But what if you can't get in touch with them immediately, or don't know who to contact? Instead of trying to solve the problem without support, a DevOps toolchain could have mitigated this chain reaction from the start.

Major IT Outage 2021 Recap

We saw that no one is immune from major IT outages in 2021, not even mega titans like Google, Facebook, and Amazon AWS. The following is a recap of some of the major IT outages with widespread impact for 2021. Amazon Web Services’ (AWS) historic outage occurred on December 7, 2021 and lasted roughly 6 and a half hours. The breadth of Amazon and its reach caused not only their warehouse and delivery operations to stop.

Slack outage

Slack, a popular enterprise communications platform, faced a 5-hour system outage yesterday between 9:25 AM – 2:24 PM EST on February 22, 2022. Slack services affected included: messaging, search, link previews, apps/integrations/APIs, posts/files, workspace/org administration, login/SSO, notifications, connections, and calls. AlertOps was NOT affected by this outage.

Cloud Incident Management Guide

It is a well-established fact that companies looking to grow in the digital age can facilitate this mission by adopting the cloud. When pursued with the right intent and implementation strategy, cloud adoption acts as a powerful force multiplier, yielding a cutting-edge IT powerhouse for businesses and helping them grow and innovate at an accelerated pace. Organizations that adopt a cloud-first strategy must safeguard themselves from critical, service-disrupting incidents.

PagerDuty Receives Financial Services Competency From AWS

We are excited to announce that PagerDuty is now an approved AWS Financial Services Competency Partner. We’re looking forward to expanding our global reach and helping financial services organizations accelerate their cloud migration and digital acceleration journeys. This will allow us to further streamline and automate financial service companies’ digital operations while helping them reduce risk and manage compliance requirements.

Episode 3: Mooving to... Stability: The Role of Catastrophic Failure in Software Design

In this episode of Mooving to… Stability: The Role of Catastrophic Failure in Software Design, we had the opportunity to chat with Jeff Atwood, yes that Jeff Atwood of, Coding Horror, Stack Overflow, and Discourse (Chief Happiness Officer). Jeff started writing 911 software in Boulder, Colorado for a small company, which was a crash-course in writing code for software that has real consequences. With this unique and deep perspective, B.J.