Operations | Monitoring | ITSM | DevOps | Cloud

PagerDuty

5 Best Practices for Resolving Errors Quickly

I love writing software, but I hate dealing with bugs. They take you away from what you want to be doing and often lead you into a rabbit hole. At Sentry—an open-source error tracking platform that provides complete app logic, deep context, and visibility across the entire stack in real time—we have a few tips that we’ve honed over time to make error resolution painless (ok, less painful), including an official integration with PagerDuty.

Incidents as we Imagine Them Versus How They Actually Are with John Allspaw

There is a tendency to imagine (or remember!) incidents as unfolding much neater and orderly than they actually are. Events can lead some engineers scratching their heads about what is happening, while their teammates can instead be confused about how it's happening.

Real-Time Operations Maturity: How Businesses Can Thrive in the Digital Era

It’s rare to find a business today that doesn’t rely on digital technologies and services. Retail is one example: Whether customers are buying online or in store, completing a transaction requires a website or point-of-sale system. The entire supply chain relies on IT services to deliver goods on time, to the right locations, and just like any company today, every department —from development and marketing, to HR and business services—has a critical tech stack.

Using Real-Time Operations to Save Lives

Voices wield power. Staying silent is not an option. We must speak up and honor those who do. October is National Domestic Violence Awareness Month, when communities come together to support victims and survivors of domestic abuse across the world. Earlier this month, SisterDuty, one of PagerDuty’s Employee Resource Groups (ERG), led a campaign to build toiletry kits and raise funds to benefit Casa de las Madres, which offers shelter and support to those at risk of abuse.

Monitoring that Monitors the Monitors of the Monitors

One way to break the cycle of alert fatigue is by improving the quality of the signals you monitor. That can mean greater resolution at which monitoring data is ingested and processed, smarter statistical methods for aggregating and correlating data across multiple services, or routing alerts through an escalation and incident management system.

This IS NOT Fine: Putting Out (Code) Fires

So the dumpster is on fire. Again. The site’s down. Your boss’s face is an ever-deepening purple. And you begin debating whether you should join the #incident channel or call an ambulance to deal with his impending stroke. Firefighters have clear procedures and a strong hierarchy. The first truck at a scene immediately begins assessing the situation.