Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

New IDC Study Highlights PagerDuty's Multi-Million Dollar ROI

Dependence on digital business skyrocketed in the last year, with customers expecting seamless, always-on access to applications and digital services from any device, anywhere. This trend has placed developer and IT teams under more pressure than ever before to not only deliver these digital experiences, but keep them up and running at all times.

What is IT Infrastructure Management (IM)?

Effective IT Infrastructure Management or “IM” is crucial. If your business prioritizes infrastructure management, it is well-equipped to keep its software applications and networks running at peak levels. Plus, your business can avoid downtime, outages, and other costly, time-intensive IT problems that put your operations and stakeholders at risk. How you manage your infrastructure can have far-flung effects on all aspects of your business.

Actionable Insights - Faster Incident Resolution with Datadog and Moogsoft Observability Cloud

Context is king, they say, and anything you can do to improve context both makes decisions and assessments more reliable and speeds up the decision process. A new, bi-directional integration between Moogsoft Observability Cloud and Datadog does just that. Many SRE teams rely on Datadog to provide comprehensive information about their application stacks.

What is the Difference between SLAs and OLAs?

In traditional IT environments, services to customers are delivered and supported by the organization. A Service Level Agreement (SLA) is created with details like what would be the availability of service be, how reliable the service would be, what penalties can be charged in case of downtime, etc. The internal teams like the network administration team, development team, IT service desk, etc. would then draw up Operational Level Agreements (OLAs) to support the SLA.

"I'm Just Doing my Job," An SRE Myth

"Sorry, but I'm just doing my job." I heard this recently from a customer service representative. What they were saying made sense (afterall, we don’t have total control over our work environments), but it felt wrong. As a customer, I was left dissatisfied with our interaction. However, the representative assured me that they were simply following protocol. This got me thinking: can established practices and protocols sometimes get in the way of excellent customer experience?

Stay Alert to Security With Xray and PagerDuty

When it comes to securing your software development against open source vulnerabilities, the earlier action occurs — by the right person — the safer you and your enterprise will be. Many IT departments rely on the PagerDuty incident response platform to improve visibility and agility across the organization.

Incident Communication Is a Key Part of Resolving Network Issues

You’ve just received a notification—a major network issue has occurred. Hoping it’s a false positive, you complete an initial triage. Dang it! It’s the real thing. If you’re like me, your mind likely turns to one thing: fixing the issue as fast as you can. But hold on! Before you turn completely to fixing it, there’s another important aspect to any incident that you can’t forget, and that’s incident communication.

Carrefour Bank Uses PagerDuty and Rundeck to Automatically Self-Heal Incidents

With the mission of transforming the customer experience for financial services, Carrefour Bank offers a wide portfolio of financial products created to meet and satisfy different customer needs. Learn how Carrefour Bank leverages PagerDuty and Rundeck to automatically self-heal incidents to keep customers happy and resolution times down.

PagerDuty's Ops Guides Get a Fresh New Look

The Community and Advocacy Team here at PagerDuty recently spruced up our library of ops guides, and we’re excited to share them with you. If you’re not familiar with the ops guides, they are an open-sourced collection of long-form documents that cover a variety of topics related to real-time operations and incident management. We’ve given them some spiffy new headers, cleaned up some sneaky errors, and added a new section titled “Next Steps.”