Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

blameless

Using Automation and SLOs to Create Margin in your Systems

With the difficulties we’re facing during this time, it can be difficult to keep up with the increasingly vast demand for our services. You need to make use of all the tools in your toolbelt in order to conserve your team’s cognitive resources. Two ways you can do this are through automating toil from your processes and prioritizing with SLOs.

onpage

NHS on Its Final Leg of Pager Replacement

If you’ve been following the U.K. healthcare landscape, you would know that the country has been considering replacing pagers for the longest time. This may soon materialize, partly accelerated by the challenges that doctors are facing during the COVID-19 pandemic. The pager replacement initiative not only signifies a pivotal shift from the aging infrastructure, but it also indicates how pagers have failed to thrive in today’s unprecedented times.

derdack

Ready to move on and pick up speed again

We are going through an incredibly difficult time of uncertainty, lockdowns, cutbacks, and even fear. Taking this time to optimize and rethink the way we do business is essential in ensuring we get back on track and return even stronger than before. Most of us have been working from home for months now and, in some cases, there is no end in sight. How are you and your operations holding up? Are you able to work, maintain, and control your infrastructure?

pagerduty

Postmortems and More With J. Paul Reed

PagerDuty sat down with J. Paul Reed, a Senior Applied Resilience Engineer at Netflix, for an Ask Me Anything (AMA) to discuss best practices around postmortems. Reed is a prominent speaker and advocate of DevOps and operations complexity, and has over 15 years of experience in release engineering. His background in tech, along with his previous work at companies like Mozilla and VMware, give him a unique perspective into the inner workings of innovative organizations.

blameless

How to Classify Incidents

Incident classification is a standardized way of organizing incidents with established categories. Incidents can include outages caused by errors in code, hardware failures, resource deficits — anything that disrupts normal operations. Each new incident should fit into a category dependent on the areas of the service affected, and in a ranking of the severity of the incident. Each of these classifications should have an established response procedure associated with it.