%term

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

UptimeRobot Alerts Spike 5x Due to Microsoft/CrowdStrike Global Issues

Jul 19, 2024 By Tomas Koprusak In Uptime Robot

Given recent global events, UptimeRobot is experiencing an increased number of downtime notifications. We are currently sending out five times more notifications than usual due to a widespread power outage impacting several critical services worldwide. Here’s a brief overview of the situation and how it affects our monitoring services.

Read Post

Uptime Robot

Read more about UptimeRobot Alerts Spike 5x Due to Microsoft/CrowdStrike Global Issues

The IT Scramble is On with a Microsoft Outage: Incident MO821132 - July 18, 2024

Jul 19, 2024 By Sara Purdon In Martello Technologies

On July 18, 2024 at 6:38 pm ET, Vantage DX, Martello’s Microsoft 365 and Teams performance management solution, started to see indicators of a likely Microsoft outage impacting users’ ability to access various Microsoft 365 apps and services. Almost an hour later at 7:41 pm ET Microsoft issued a statement on X.

Read Post

Martello Technologies

Read more about The IT Scramble is On with a Microsoft Outage: Incident MO821132 - July 18, 2024

Global Microsoft Outage and Preventing Future Vulnerabilities

Jul 19, 2024 By Mishal Alam In uptime

In a recent unexpected turn of events, a faulty component in the latest CrowdStrike Falcon update led to widespread outages, crashing Windows systems globally. The repercussions were felt across various sectors, including airports, TV stations, hospitals, and even emergency services in the U.S. and Canada. The glitch, affecting both Windows workstations and servers, resulted in massive outages, bringing entire companies to a standstill and crashing fleets of hundreds of thousands of computers.

Read Post

uptime

Read more about Global Microsoft Outage and Preventing Future Vulnerabilities

July 19th global IT outage reminds us of digital complexity

Jul 19, 2024 By Dritan Suljoti In Catchpoint

As we write, on Friday July 19th, a massive global cyber outage is continuing to take down critical services around the world dependent on Microsoft-based computers.

Read Post

Catchpoint

Read more about July 19th global IT outage reminds us of digital complexity

Beyond the Headlines: The Unsung Art of Software Outage Management

Jul 19, 2024 By Robert Ross In FireHydrant

Today, the entire world is feeling the pain of a major software outage. While we know a lot about these occurrences—our entire business is built on helping companies manage incidents and outages effectively—we’re not here to share our opinion on it. Instead, we’d like to help those unfamiliar with the incident lifecycle understand what happens when an outage like this occurs, who is responsible for what, and what companies ultimately do to get things working again.

Read Post

FireHydrant

Read more about Beyond the Headlines: The Unsung Art of Software Outage Management

Learning Moment: Effective Customer Communication During Incidents - Enhance Visibility & Response with Uptime.com

Jul 19, 2024 By Jonathan Franconi In uptime

The recent global outage caused by an operating system update reminded me of how vulnerable we are today and most importantly, how close we are always teetering on global scale incidents with millions of interconnected dependencies. When the base of the house collapses, everything built on top is impacted. Those of us in IT Operations, Monitoring, Observability (insert the current acronym), etc., know firsthand this risk; we face it every day.

Read Post

uptime

Read more about Learning Moment: Effective Customer Communication During Incidents - Enhance Visibility & Response with Uptime.com

Integration Spotlight: PagerDuty and Robusta

Jul 19, 2024 By PagerDuty In PagerDuty

Bring powerful AI troubleshooting and cause analysis to your incident response with Robusta's integration with PagerDuty. Join us to learn more from CEO Natan Yellin on how your team can improve your k8s reliability.

View Video

PagerDuty

Read more about Integration Spotlight: PagerDuty and Robusta

Incident vs Problem: What's the Difference?

Jul 18, 2024 By Ekaterina Glozshtein In Alloy Software

For the rest of the world, these are just two synonyms. But in ITIL, the main IT service management framework, the distinction is crucial. Let’s find out.

Read Post

Alloy Software

Read more about Incident vs Problem: What's the Difference?

Time, timezones, and scheduling

Jul 18, 2024 By Henry Course In Incident.io

Our On-call product has been in the wild for a few months now, and in this post I want to talk about building a time-sensitive system and what we did to handle some of the challenges. I’ll cover what our scheduler is responsible for, the basics of working with time, and talk a bit about how we tested our system.

Read Post

Incident.io

Read more about Time, timezones, and scheduling

What is ServiceOps?

Jul 17, 2024 By Sam Osborn In BigPanda

Service operations (ServiceOps) is a technology-enabled approach that unifies IT operations and IT service (ITSM) teams and facilitates frictionless collaboration for more effective incident management. ServiceOps combines people, processes, and technology to improve visibility, workflows, and collaboration between otherwise siloed departments. Organizations of all sizes and industries worldwide have adopted ServiceOps.

Read Post