Operations | Monitoring | ITSM | DevOps | Cloud

4 New Ways to Improve Incident Management with Event Orchestration

In an era where efficiency and smart technology integration are key, 71% of technical leaders report their companies are expanding their investments in artificial intelligence (AI) and machine learning (ML) this year. With the sheer volume of data coming into the enterprise and the need for timely response, monitoring every incoming alert around the clock is impractical, and human vigilance alone is too imprecise.

Myth vs. Reality: Lessons in Reliability from the July 19 Outage

It was 3AM at Newark Liberty International Airport. I was groggy, waiting in line to get my boarding pass, only to be met with a blue screen on the check-in kiosk. Needing some coffee, I learned the vendor was only accepting cash. There was clearly a big outage and I quickly checked our systems at PagerDuty. Major outages happen multiple times per year, so frequently that we have an internal dashboard (colloquially referred to as “the internets are broken”).

Modernize your Operations Center and Build Operational Resilience with the Latest Features from PagerDuty

Global IT disruptions and outages are becoming the new normal, testing the operational resilience of businesses everywhere. How well prepared your team is to handle major incidents determines how fast the business can return to normal. Operations Centers are relied on to manage these disruptions and ensure quick recovery. They’re the point of entry for incoming data that holds important signals of impending failure that impact customers, the business, and the bottom line.

Managing Vendor Incidents: Customer Impact That Isn't Your Fault

One of the first key tenets of cloud computing was that “you own your own availability”, the idea being that the public cloud providers were making infrastructure available to you, and your organization had to decide what to use and how to use it in order to meet your organization’s goals. The cloud providers have no knowledge of your applications or their KPIs.

PagerDuty Executive Spotlight Series: Vodafone

Vodafone is a Global 500 telecommunications company in Europe and Africa servicing over 320 million mobile customers across 21 markets. In this PagerDuty Executive Spotlight, we sat down with Ahmed Elsayed, UK CIO & Digital Engineering Director at Vodafone, to discuss his experience unifying a global engineering team to streamline the development and deployment of digital products and services to ensure an exceptional customer experience.

Balancing Centralization and Autonomy: The Key to Automation at Scale

The recent global outage reminds us that identifying issues and their impact radius is just the first part of a lengthy process to remediation. Incidents are inevitable; how we prepare for and learn from them is what sets teams up to respond more effectively next time. As we saw from the remediation steps taken by enterprises around the world, implementing a known fix across a large number of environments that are potentially managed by a number of distributed teams can be a gargantuan challenge.

Are you Prepared for Your Next Major Outage?

Software is not perfect. And ultimately, it’s not a matter of if you will have an outage, but of when. With the increasing complexity and frequency of IT incidents, is your organization prepared to respond and recover when each second counts? Here at PagerDuty, we’ve compiled a list of best practices to keep your systems up and running.

Reducing Coordination Costs in Incident Response

Incidents can happen anywhere at any time. They can be small, well-defined, and easily contained. They can be large, messy, and complex, like the major outage we saw recently. Or they can be somewhere in between. When incidents occur, mobilizing and coordinating responders is crucial to restoring service, protecting the customer experience, and mitigating business risks.