Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Learn how PagerDuty can help address critical work across all departments

PagerDuty’s Operations Cloud helps organizations with critical work across the entire business, from IT teams to customer service to human resources, marketing, sales, and more. With PagerDuty, organizations can prioritize accurately, respond efficiently, and reduce operational overhead. In this blog post, we’ll share examples of how PagerDuty can be used for critical work in all departments, not just IT, using our new Solution Guides for Business.

SRE and the Practice of Practice

Part of the trepidation of being on-call is encountering unfamiliar emergency scenarios where we are surprised by suddenly not knowing how to do our jobs. We feel lost and alone, complicated by the world around us, powerless to resolve or even mitigate the problem. On-call need not be a solo affair full of fear and anxiety. There are ways we can employ practice and open collaboration outside of incidents to prepare us better.

What the Ideal Incident Lifecycle Should Be

Today’s organizations are managing increasingly complex IT ecosystems and pressured to deliver on innovation—all while trying to maintain service performance and reliability to keep up with the always-on digital economy. With IT complexity growing exponentially, incidents have become a common, if not day-to-day struggle for many businesses. Incident management is the process or method that modern organizations use to prepare for and respond to service disruptions.

The Universal Language: Reliability for Non-Engineering Teams

We talk about reliability a lot from the context of software engineering. We ask questions about service availability, or how important it is for specific users. But when organizations face outages, it becomes immediately obvious that the reliability of an online service or application is something that impacts the entire business with significant costs. A mindset of putting reliability first is a business imperative that all teams should share.

Building an SRE Team with Specialization

As organizations progress in their reliability journey, they may build a dedicated team of site reliability engineers. This team can be structured in two major ways: a distributed model, where SREs are embedded in each project team, providing guidance and support for that team; and a centralized model, where one team provides infrastructure and processes for the entire organization.

The Human Side of Being On-call: 5 Lessons for Managing Stress, Anxiety, and Life While Being On-call

Within DevOps, we talk a lot about the on-call process—but what about the human side of being on-call? For example, what are effective ways of managing stress and anxiety during a shift? How can one manage life situations that make being on-call difficult—such as being responsible for watching the kids during an on-call rotation? And how can an empathic team culture help prevent burnout and turnover?

Fairwinds: Kubernetes Guardrails and Governance to Enable Developers and Reduce Risk

Customers of both PagerDuty and Fairwinds Insights can generate and customize PagerDuty incidents for critical issues in their Kubernetes clusters. This capability includes over 100 checks that have been built-in to Fairwinds Insights for things like container vulnerabilities, insecure workload configurations, runtime security events, and resource usage—as well as custom user-defined policies for compliance and internal requirements.

Stakeholder Notifications

With the AlertOps ServiceNow integration, you can automatically send updates to stakeholders. Set each update to use the notification channel you choose (email, voice, SMS, mobile app, and chat). Set triggers to send alerts on any condition, such as SLA breaches, status changes or any custom field change. Automatically updates at time points that you set. AlertOps also logs all activities in ServiceNow so you can track everything in one place.