Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Five Healthcare IT Trends to Watch in 2021

Healthcare information technology (healthcare IT) trends focus heavily on process improvements and clinical efficiencies. Providers can use advanced, emerging technologies to deliver quality care and overcome the challenges of today’s global health crisis. Trendspotting allows healthcare organizations to stay prepared for disruption and ensures they continue to innovate every year.

New Ops Guide: Best Practices for On-Call Teams

The always-on, always-available expectations of digital services have increased the requirements of technical teams to be ready and provide response around the clock. For teams new to this concept, introducing on-call can be stressful and complex. As part of PagerDuty’s main platform, on-call management is key to our business, but the non-technical aspects are also important for teams to consider.

Streamlining IT Operations with BigPanda and ServiceNow

Does the following sound familiar? You have a complex, hybrid and dynamic IT stack – with your cloud infrastructure changing by the minute and your container infrastructure changing by the second. Your monitoring and observability tools provide excellent visibility into your infrastructure, your applications and your services, but the dynamic environment in which they operate causes them to generate large volumes of heterogeneous machine data, with thousands of alerts a minute.

How to Improve Your Building Management System

A building management system (BMS) lets your business monitor and control mechanical and electrical equipment across one or more buildings. Heating, cooling, and ventilation (HVAC), security, and other systems linked to a BMS usually represent 70% of a building’s energy usage. So, proper configuration of your BMS is key — otherwise, a poorly configured system can negatively impact your building’s efficiency, maintenance, security, and safety.

4 Tips on Preparing for a [Great] Failure

The most essential lesson of SRE is that failure is inevitable. This shouldn’t be a cause for despair. SRE shows how embracing failure is empowering. By celebrating failure, you can accelerate development and foster a culture of learning. Rather than hoping to prevent failure, SRE prepares you to respond well to it. It can be difficult, if not impossible, to anticipate where failure will occur in complex systems given unknown unknowns.

What are MTTR, MTBF, MTTF, and MTTA? A guide to Incident Management metrics

In the present fast-moving digital world, it has become critical for businesses to measure and track their service delivery performance especially the incident management metrics that monitor the uptime of systems, downtime due to outages, and how fast and efficiently issues are resolved because even a slight glitch in the system can cause disruption in the business processes costing millions of dollars.

Using BigPanda and ServiceNow to prevent and resolve outages

BigPanda augments ServiceNow and helps IT Ops teams work more efficiently in modern IT Stacks, reducing MTTR by 40% or more. By using BigPanda and ServiceNow together, IT Ops teams are provided with real-time service mapping for dynamic infrastructures, can easily reduce and automate ServiceNow ticketing, and are able to surface the root cause changes affecting their continuous delivery.

Customer Devotion: How We're Bringing OneDuty to Life

It’s been almost a year since the world changed overnight and industries across the world quickly adapted to living, working, and learning fully virtually. While the world seemed to stop in an instant, many businesses saw an increase in demand and new challenges. PagerDuty was no different.

Communication Tool Down? Here are 3 Ways to Handle it

January 4th, 2021, the communication service Slack suffered a major outage. Teams working remotely found their primary communication method unavailable. The incident lasted over 4 hours, during which some customers had intermittent or delayed service, and others had no service at all. It was a reminder that even the most established tools are susceptible to downtime. This is a core lesson of SRE: that failure is inevitable.