Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Microsoft Entra ID Outage: How Vantage DX Detected the Issue Before Microsoft Acknowledges the Issue

On February 25, 2025, at 11:32 AM EST, Martello’s Vantage DX monitoring began alerting on an issue affecting Microsoft Entra ID (Azure AD SSO). While Microsoft had not yet acknowledged the incident, online reddit forums had noted the issue and our Vantage DX proactive monitoring detected disruptions impacting authentication across multiple workloads. See here the critical warning for Exchange in Vantage DX Monitoring. Here is the critical warning for OneDrive and SharePoint in Vantage DX.

Operational excellence in the age of AI and Automation

The future of operations is here with PagerDuty's groundbreaking AI and automation innovations. Learn how PagerDuty AI agents, powered by PagerDuty Advance, and new use cases like security incident management and LLMOps can help your organization achieve operational excellence to reduce cost, mitigate the risk of outages, and accelerate innovation.

February 2025 Box Outage: Timeline and Post-Mortem

Box.com is a cloud-based content management and file-sharing platform designed for the enterprise and used by nearly 100,000 companies around the world. When a Box outage strikes, businesses can experience costly disruptions. On February 19, 2025, a disruption in core Box services including uploads, downloads, and the All Files page, affected thousands who depend on the cloud storage and collaboration platform.

Feature Spotlight - Post-Incident Reports

The Post-Incident Report builder is available to Advanced plan customers to help document the incident post-mortem process. This allows users to share key information and understanding about why an incident occurred, how resolvers responded, and what preventive actions can be taken to ensure it doesn't happen again. After creating a Post-Incident Report, you can share it with other colleagues or stakeholders to keep them informed about the steps you’re taking to mitigate and prevent potential recurrences.

How to connect Google Calendar events and Slack

Managing Google Calendar events within Slack has never been easier! Pagerly’s Slack integration is the ultimate solution for teams looking to streamline their event management, on-call scheduling, and team communication—all without leaving Slack. Whether you need event reminders, real-time Slack status updates, or automated Slack notifications about important events, Pagerly ensures your team stays informed and organized.

New Integration: ilert + RapidSpike for Proactive Website Monitoring

We are pleased to announce a new inbound integration in the ilert catalog: RapidSpike. This integration enhances incident management by connecting ilert with RapidSpike’s website monitoring capabilities, ensuring teams receive real-time alerts on website performance, uptime, and security threats.

Runbook Automation and Rundeck v5.9 Release Notes

Product Manager Forrest Evans takes us through the new features in Runbook Automation v5.9, including a demo of incorporating Azure Key Vault in your automation jobs. For a full listing of the release notes, see the release notes page. Learn more about automation solutions, including new components to support your FinOps needs on the solutions page.

OnPage Wins Spot on G2's Best Healthcare Software 2025 (Announcement Video)

OnPage Named in G2’s 2025 Best Healthcare Software List! We’re excited to share that OnPage has been recognized in G2’s 2025 Best Healthcare Software list! This recognition is driven by real customer reviews from healthcare teams who rely on OnPage to streamline communication, improve response times, and enhance patient care. In this video, our Head of Marketing, Ritika Bramhe, shares the big news and reads some of our favorite customer reviews that made this achievement possible.

Streamline IT incident response with the latest BigPanda features

Machine-generated data has exceeded human scalability, straining L1 Ops and Service Desk team resources. Fragmented data across tools, teams, and silos hinders situational awareness, delaying each action – from detection to remediation, making prevention increasingly unattainable. The latest BigPanda updates enhance ITOps and ITSM team efficiency throughout the incident lifecycle.