Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

A Beginner's Guide To Service Discovery in Prometheus

Service discovery (SD) is a mechanism by which the Prometheus monitoring tool can discover monitorable targets automatically. Instead of listing down each and every target to be scraped in the Prometheus configuration, service discovery acts as a source of targets that Prometheus can query at runtime. Service discovery becomes crucial when there are dynamically changing hosts, especially in microservices architectures and environments like Kubernetes.

Top 5 outages detected by StatusGator in October 2024

StatusGator’s Early Warning Signals alerted customers to several notable service outages in October 2024. With advanced warning, our users could take proactive measures, minimizing the impact of downtime on their businesses. Here’s a summary of how our detection gave customers an edge over service disruptions, often notifying hours or minutes before the provider even acknowledged the issue.

Incident Response Automation: How It Works & Why It Speeds Up Resolutions

The speed at which you respond to incidents can make or break user satisfaction, team morale, and business continuity. Whether it’s a server crash, a security breach, or a software bug affecting users, rapid and efficient incident management is key to maintaining a strong reputation and minimizing operational downtime. And while traditional manual responses have worked in the past, automated incident response is now paving the way for faster, smarter, and more efficient handling of these issues.

Demo Roundups! Automation Standardization (Workflows)

Join PagerDuty’s Solutions Consultants Bobby Zimmerman and Justyn Roberts to discover how combining technical automation with human-driven processes can reduce manual interventions, streamline repetitive tasks, and increase operational efficiency. Level up your digital operations expertise with PagerDuty Demo Roundups — a series of live, interactive webinars where you can deepen your knowledge in the Operations Cloud and see how PagerDuty can work for you. Each 1-hour session presents a hands-on demo that showcases PagerDuty’s capabilities in real-time followed by Q&A.

Site Reliability Engineer's Guide to Black Friday

It’s gotten to the point where Black Friday reliability prep has to start on…well Black Friday. This year, 32% of consumers in the US claimed that they were going to start their holiday shopping in July-October. Plus, Black Friday isn’t the only day eCommerce businesses have to worry about, now we have Cyber Monday, Travel Tuesday, and the thousands of Prime Days from Amazon.

Lessons from 4 years of weekly changelogs

Writing a meaningful update for customers every week has been held sacred at incident.io since we started the company. We've written over 200 of them in the past 4 years, and we recently celebrated going 2 years straight without missing a single a single week The numbers themselves are not the goal, but the consistency of this habit and what it represents for our customers and our team is very real, and special to me.