Operations | Monitoring | ITSM | DevOps | Cloud

The one where we scaled

From 3 people in 2020 to 93 in 2025—incident.io has come a long way, and we’re just getting started. Whether you’ve been here since the early days or just joined, this is what it looks like to build something great *together*. If you're after:️️ Great people Real impact (across the globe, not just in Greece) A place where growth is the default And teammates who’ll always be there for you... We’re hiring! (And we're going to need a bigger couch…)

We Built an SRE Agent With Memory And It's Transforming Incident Response

If you feel like your incidents are multiplying while your stack gets more complex by the week, you’re not alone. Event volumes keep climbing, signals live in a dozen tools, and human responders are stretched thin. That’s exactly why we built the PagerDuty SRE Agent—a vendor‑agnostic AI teammate that improves with every response to make the next one faster, smarter, and more reliable.

Too Late to Learn: Why Security Post-Mortems Fail and How AI Can Help

An effective post-mortem can turn a security breach into a blueprint for lasting resilience. But too often, in the stress of an incident, documenting what happened takes a back seat to containment and recovery. The resulting analysis relies heavily on memory, scattered notes, and competing narratives. Valuable context gets lost, timelines blur, and lessons that could strengthen defenses never become institutional knowledge.

How agentic ITOps helps ensure resilient IT infrastructures

Infrastructure resilience is essential for any modern IT environment. Downtime is expensive. Beyond the stresses of day-to-day operations, you want to be confident that your IT systems will continue functioning during service disruptions, hardware failures, or natural disasters. Agentic ITOps can help ensure a reliable, resilient IT infrastructure environment. These systems use agentic AI to help IT teams minimize downtime, improve customer trust, and protect your business’s revenue and reputation.

Jira Service Management (JSM) Review for Alerting (2025)

Atlassian is shutting down OpsGenie. New sales stopped on June 4, 2025, and the platform will be completely offline by April 5, 2027. As an OpsGenie user, you now face a critical decision: Migrate to Jira Service Management (JSM), Atlassian’s recommended path, or choose a different solution. And if you’re not sure JSM is the right fit for your team’s alerting needs, this review will help you decide. I signed up for JSM and put it through real-world testing.

Product Update - Turn Off Alerts, Use Microsoft Teams, and Custom Domains

Over the last few months IncidentHub has added several new features to make it easier to fine tune your alerts. IncidentHub now also integrates with Microsoft Teams and supports custom domains for your public status pages. Let's take a comprehensive look at what's new.

SLA, SLO, and SLI: Understanding the Foundations of Service Reliability

Last week, I ordered a pizza on a food delivery app. And they promised the delivery in 30 minutes. Similarly, all digital services: Apps, websites, cloud platforms, etc, make promises about speed, uptime, and reliability. The difference is how they track and measure those promises. That’s where SLA, SLO, and SLI come in. These three metrics define what “reliable” actually means. They turn a vague claim like “99.9% uptime” into something you can measure, track, and act on.

The Silent Failure: When Monitoring Doesn't Wake the Right People

At 2:07 a.m., one of the core production nodes went down. CPU usage spiked, latency shot through the roof, and requests began timing out across the cluster. Monitoring tools lit up instantly. Datadog dashboards turned red, Prometheus fired alerts, and a webhook pushed incident payloads into Jira. Everything worked exactly as designed. Except no one responded. The alert chain fired flawlessly through machines, but the right human never saw it because it was sent via an automated phone call.

PagerDuty Incident Responder custom agent for Github is now Generally Available!

This custom agent in GitHub’s AI ecosystem gives users access to PagerDuty data (including change correlation, incident data, and more) directly in GitHub Copilot, saving time from context switching for faster resolution. The agent can help users analyze incident context, identify recent code changes, and suggest fixes via GitHub PRs. Learn more about PagerDuty’s MCP capabilities for GitHub and other tools here.

Bring incident response to AI stack with ilert's MCP Server

ilert’s engineering team has developed an open Model Context Protocol (MCP) server that enables AI assistants to securely interact with your alerting and incident management workflows, from determining who is on call to creating incidents. In this article, we provide a simple explanation of MCP, outline the reasons behind our investment in it, describe the high-level architecture, and explain how to connect Claude, Cursor, and other MCP clients to ilert today.

Integration & Data Ingestion: Strengthening AIOps Observability

Large enterprises face the challenge of managing high-volume, very diverse data streams that span both legacy and modern, digital systems and applications. To gain timely, accurate insight across this kind of complexity, IT teams need observability platforms that can do more than just monitor - they must also unify, contextualize and enrich data so teams can act effectively to protect the availability of the services their customers rely on.

Disaster Recovery: Everything You Need to Know

With increasing cyberattacks and cloud outages, maintaining system resilience is critical. A robust Disaster Recovery (DR) strategy enables teams to prepare for unexpected events. It makes sure they can recover critical systems and data with minimal disruption. This blog will cover what disaster recovery is, why it matters, and the key components of an effective Disaster Recovery Plan. We’ll also walk through the steps for creating your own strategy.

Top tips for smoother IT incident management

Top tips is a weekly column where we highlight what’s trending in the tech world and share ways to stay ahead. This week, we’re talking about something every IT team knows too well—incidents. Whether it’s a sudden server crash, a network outage, or a system slowdown right before an important client call, incidents always seem to strike at the worst possible time. No matter how strong your IT setup is, issues are bound to happen.

Your Next Incident Has Already Started. You Just Haven't Noticed Yet.

The best way to minimize the impact of an incident is to catch it early, before small issues snowball into major disruptions. That requires maintaining healthy systems and ensuring sufficient resources are available when problems arise. But developers and IT operations pros working in large enterprises face a challenge: Complex systems operate in an inherently degraded state. In his essay “How Complex Systems Fail,” Dr.

Your Top Engineers Should Be More than Expensive Button-Pushers

The engineer you pay $200,000 a year just spent an hour copy-pasting data between dashboards. Again. Software engineers have critical skills that are in the highest demand. And yet, many world-class engineers are currently spending too much of their time clearing tickets, routing alerts, and responding to the same types of incidents over and over again. This operational toil is costing you.

DNS Outages Expose Hidden Risks. Edwin AI Finds Them Faster.

The recent AWS outage exposed how fragile the internet remains. Amazon traced the hours-long disruption to a DNS error—a small failure with massive reach. For most organizations, DNS operates quietly in the background. When it fails, every digital service connected to it stops. One of LogicMonitor’s valued customers, IG Group, faced a similar event less than ten hours after enabling Edwin AI.

Demo Roundups! What's New in Schedules: Flexible Shifts + AI Conflict Resolution

Manual scheduling and on-call gaps cost your team sleep and sanity. Join us for a demo of PagerDuty's latest schedule experience improvements. From iCal-compatible shift management to AI-powered conflict resolution, see firsthand how to build bulletproof on-call coverage with minimal operational overhead.

What Is Business Continuity?

A single outage can stop operations, affect customers, and impact trust. In a world of pandemics, cyberattacks, weather events, and supply chain delays, your team cannot pray that something does not break. Business continuity drives your team to stay ready, recover earlier, and keep downtime lower. In this blog, we’ll explain what business continuity means, how to create a solid business continuity plan, and which approaches help teams keep operational during a disruption event.

What Is Incident Response Lifecycle?

The Incident Response Lifecycle is a step-by-step process that helps engineering teams detect, respond to, and recover from unexpected system disruptions or outages. It includes a series of six practical stages: Detection, Analysis, Impact Mitigation, Incident Resolution, Service Restoration, and Post-Incident Analysis. By following this lifecycle, teams can minimize downtime, reduce business impact, and continuously strengthen system reliability.

How to manage ilert call flows via Terraform

Call flows let you design voice workflows with nodes like “Audio message,” “Support hours,” “Voicemail,” “Route call,” and much more. The ilert Terraform provider now includes a ilert_call_flow resource so you can version and promote these flows across environments. This blog post offers an overview of managing call flows in Terraform, detailing the benefits and key scenarios.

Demo Roundups! What's New in Schedules: Flexible Shifts + AI Conflict Resolution

Manual scheduling and on-call gaps cost your team sleep and sanity. Join us for a demo of PagerDuty's latest schedule experience improvements. From iCal-compatible shift management to AI-powered conflict resolution, see firsthand how to build bulletproof on-call coverage with minimal operational overhead.

How to create Rotation or Shift Schedule on Calendar easily

‍ Managing shift schedules, rotation cycles, and different types of shifts has always been one of the trickiest parts of workforce management. Whether you’re coordinating day shifts, night shifts, evening shifts, or split shifts, keeping track of employee availability, selected days, and total hours across multiple teams is a challenge.

Meeting Developers Where They Work: PagerDuty + Spotify Portal for Backstage

From the beginning, PagerDuty has been built by developers, for developers. Our mission has always been to help development teams build faster and resolve incidents more efficiently by meeting them where they work. Building on PagerDuty’s existing plugin for Spotify for Backstage, we are thrilled to announce the PagerDuty plugin for Spotify Portal for Backstage to continue bringing enterprise-grade incident management into even more developer workflows.

Best MSP Tools of 2025

Managed service providers (MSPs) are strong multitaskers, handling monitoring, documentation, security, infrastructure maintenance, support, and more for each of their clients. So clearly the need for a strong set of MSP tools is one that cannot be overlooked. In the current state of IT, clients expect swift response and seamless service delivery no matter the time of day, meaning, MSPs must invest in a toolkit that will enable them to deliver high-quality service 24/7.

Service disruption on October 20, 2025

When the internet goes down, our primary job is to help everyone get back up, as fast as possible. Of the almost half a million incidents we've helped our customers solve, there are some which stand out for both their scale and impact. One of these happened on Monday, October 20, when AWS had a widely covered major outage in their us-east-1 region, from 07:11 to 10:53 UTC. We’re hosted in multiple regions of Google Cloud and so the majority of our product was unaffected by the outage.

How Do I Route Alerts by Location to the Right On-Call Team?

When your company has multiple offices or operational sites – whether that’s across the U.S. or around the world – getting alerts to the right team isn’t as easy as just checking who’s on duty. Events can come from a wide range of sources tied to different physical locations, time zones, or even separate departments, and not every alert is meant for every team. Let’s say your company has operations in New York, Dallas, and San Francisco.

When IT Alerts Go Bump in the Night: A Halloween Tale of IT Alerting with SIGNL4

As the witching hour approaches, your data center hums quietly – servers glowing like jack-o’-lanterns in the dark. Everything seems calm… until suddenly, your phone lights up with a chilling alert. CPU usage is spiking. Network latency is haunting your system. The ghost of downtime lurks nearby. Welcome to the spooky world of IT alerting – where nightmares come true if your team isn’t ready.

Detect and map third-party outages with Datadog External Provider Status

Modern applications depend on dozens of external cloud platforms, APIs, and SaaS services to function. But when those providers experience issues, engineers often spend valuable time asking a basic question: Is the problem with us or with them? Provider-maintained status pages are often slow to update, leaving teams waiting for confirmation while incidents escalate. This delay wastes valuable time, prolongs investigations, and risks customer trust.

The Hidden Risk of DNS - Lessons from the AWS Outage & Why You Need DNS Spy Monitoring NOW

On October 20, 2025, much of the internet came to a halt. Apps wouldn’t load. Payments failed. Cloud dashboards went dark. From Fortnite to Alexa, Snapchat, and countless business platforms, users across the world were suddenly offline — all because DNS broke inside Amazon Web Services’ (AWS) US-East-1 region.

Amazon Isn't Eating Its Own DNS Dog Food

On October 19-20, 2025, Amazon Web Services (AWS) experienced a significant outage (AWS status) affecting its US-EAST-1 region in northern Virginia. The root cause was DNS resolution failures for DynamoDB’s API endpoints, which cascaded across AWS’s interconnected services, disrupting major platforms including Snapchat, McDonald’s, Disney+, Roblox, Coinbas, Reddit, and Amazon’s own services.

PagerDuty Joins AWS QuickSuite: Connect Your Incident Management with 1,000+ Applications

Today, we’re announcing that PagerDuty is now available in AWS QuickSuite through the Model Context Protocol (MCP). This means PagerDuty’s incident management capabilities can now connect with the 1,000+ applications and data sources that QuickSuite integrates with, from AWS services to enterprise SaaS platforms, all accessible through natural language.

AWS Outage: How do you prepare for the failure of your own safety net?

When AWS’s massive outage struck, it didn’t just take down cloud services, apps, and enterprise platforms. It also knocked out many of the monitoring systems organizations depend on for real-time answers. Observability companies, including Datadog, New Relic, Checkly, Dynatrace, SpeedCurve, and Splunk Observability, lost visibility or functionality precisely when organizations needed them most.

A Launch Day in the Life with AI Teammates

Alex, an SRE at Greenagonia, starts the day knowing there’s a big launch coming. Pre-orders suggest a 5-10x increase in normal traffic, which means coffee needs to be extra strong this morning. As Alex scans through overnight alerts, he realizes he’s completely forgotten about a dentist appointment that overlaps with his upcoming on-call shift. Six months ago, this would have meant frantic Slack messages or at least one phone call. Today? Alex’s AI teammate has it covered.

7 Ways Your Incident Management Just Got a Boost (New Feature Rundown)

All the things you may have missed that will make your incident management smarter, faster, and simply easier. We ship updates every week because we want you to get the most out of FireHydrant. But we also know it's hard to stay up to date and read every week's changelog (even though we know reading changelogs is the highlight of your week ).

Experimenting With Different Scripts

It all began when I spun up an AWS t4g.small burstable instance for a side project. Nothing unusual just another day in the cloud. But the moment I connected through SSH, something caught my eye. The system greeted me with a temperature reading of -273.5°C. Wait… what? That’s 0 Kelvin, the point where atomic motion completely stops. In other words, absolute zero , a state that’s theoretically impossible for anything to operate in.

Understand the ROI of BigPanda: Top quantitative and qualitative findings

We published the first report showcasing the business value of the BigPanda platform, based on both quantitative and qualitative feedback from more than 20 enterprise customers. The Business Value of the BigPanda Platform report provides tangible insights into our platform’s impact on business outcomes.

Agentic ITOps: The evolution of AIOps

Enterprise IT departments are struggling to keep up with the dramatic increases in complexity, fragmentation, and chaos in their IT environments. Legacy tools and processes designed for monolithic systems and static infrastructures cannot meet these challenges. Enterprise ITOps requires a more agile and intelligent approach that leverages advances in AI and automation to remain scalable, effective, and sustainable.

The 2025 Guide to Open Source Status Page Software

This is an updated version of the 2024 article. Maintaining transparent communication about service availability is crucial for businesses of all sizes. Status pages are an important part of your communication strategy during times of outages and maintenance events. You can choose to go with a fully managed status page provider or host an open-source one yourself.

Introducing the ilert × Livewatch native integration

We’re excited to announce that ilert now offers a native integration with Livewatch, unlocking seamless incident escalation from monitoring to response. Starting today, all alerts generated by Livewatch can be automatically ingested, grouped, escalated, and managed from within ilert – closing the loop between detection and resolution.

Demo - WhatsApp notifications

Demo – WhatsApp notifications: When generally available, the integration with WhatsApp will allow your PagerDuty notifications to be delivered instantly and reliably via WhatsApp. Security is built in via WhatsApp’s end-to-end encryption per their terms and conditions, so your critical information stays private and protected. With this user-friendly experience, you’ll receive alerts with clear formatting, actionable buttons, and key the context, right inside WhatsApp.

How to connect Microsoft Teams with OneUptime.

OneUptime is a comprehensive solution for monitoring and managing your online services. Whether you need to check the availability of your website, dashboard, API, or any other online resource, OneUptime can alert your team when downtime happens and keep your customers informed with a status page. OneUptime also helps you handle incidents, set up on-call rotations, run tests, secure your services, analyze logs, track performance, and debug errors.

The True Cost of Alert Fatigue: Why AI Incident Management Matters

In modern IT environments, monitoring tools are designed to keep businesses safe, reliable, and always on. Yet the flood of alerts generated by these systems often creates more harm than help. IT teams are inundated with constant notifications, many of which are duplicates, low-priority issues, or false positives. Over time, this leads to alert fatigue, a state where staff become desensitized and critical incidents slip through the cracks.

Stop Duplicate Alerts From Overwhelming Your On-Call Teams

Being on-call is one of the toughest responsibilities in IT. Engineers must be ready to respond at any hour, often balancing the stress of urgent incidents with everyday operations. But nothing drains energy faster than duplicate alerts. When one problem triggers dozens of notifications across different devices or monitoring tools, on-call teams spend valuable time sifting through noise instead of resolving the real issue.

Top 9 HIPAA Compliant Answering Services (2025 Guide)

When patients call your clinic, every second matters. Whether they’re scheduling an appointment, asking about a prescription, or reaching out after hours, they expect a live, compassionate voice…not a voicemail box. To ensure that this is the case, many teams partner with HIPAA-compliant answering services. These providers offer 24/7 coverage with trained operators who safeguard protected health information (PHI), and follow strict security standards to ensure compliance with HIPAA.

Identify recurring issues and reveal their root cause with BigPanda IT Problem Management

For many enterprises, incident response feels like déjà vu. The same issues keep happening over and over, eating up time, draining resources, and wearing down your teams. In fact, 20-40% of IT incidents are typically recurring issues, created by unresolved underlying problems. Teams prioritize speed over permanence, patching symptoms instead of addressing the root cause. They often lack the right context, documentation, or shared knowledge to permanently fix issues.

SIGNL4 October 2025 Release - Dynamic Content for Signls

Take your SIGNL4 alerts to the next level with Dynamic Content for Signls! This feature lets you automatically inject real-time data into your alerts - from incident details and ticket IDs to device statuses and system metrics. Deliver richer, more actionable notifications and help your team respond faster and smarter. Learn how to configure Dynamic Content for Signls See real-world use cases for IT, OT, and IoT operations Make your alerts meaningful and context-aware.

Demo - Backstage Integration

Demo – Backstage Integration: PagerDuty's upcoming plans for enhanced Backstage integration will power deeper insights for incident resolution by letting users pass the most relevant information from their Backstage instance into PagerDuty so that the most relevant context is available to incident responders as part of their existing workflows.

Demo - PagerDuty end-to-end Incident Management in Slack (includes Slack Work Objects)

End-to-end walkthrough of managing incidents entirely within Slack using PagerDuty's integration. This demo shows how PagerDuty automatically creates dedicated incident channels, brings in the right teams based on on-call schedules, and enables immediate response through suggested actions. Features demonstrated include.

H2 2025 Product Launch Demo

Discover PagerDuty’s latest product innovations, designed for end-to-end incident management, purpose-built AI-powered operations, and a fully integrated developer experience. We’re talking over 150 enhancements to our core incident management experience, new AI agents that remove toil before, during and after incidents, and new integrations so teams can work where they want.

5 Common Meraki Alert Problems and How to Fix Them

Cisco Meraki is built to simplify cloud-managed networking, but for many IT admins, its alerts can quickly become overwhelming. From false positives to duplicate notifications, these Meraki alert issues drain time and distract from real problems. The good news is that most of these challenges are preventable with the right Meraki troubleshooting and the addition of smart incident management. Let’s explore five of the most common Meraki alert problems and how to fix them.

PagerDuty H2 2025 Release: 150+ Customer-Driven Features, AI Agents, and More

My first 6 months here at PagerDuty have been a thrilling ride! PagerDuty continues to set the pace in incident management. With our 16-year track record of helping companies forge a path towards modern operations, we’ve been trusted by over 32,000 companies as the incident management platform of choice. Over these years, we’ve continuously delivered value to our customers at a rapid pace. And our customers have been vocal with us about wanting more.

BigPanda & Jira Service Management: Enterprise-wide visibility meets team-level autonomy

Business teams today move fast. Developers, site reliability engineers (SREs), and product owners expect to manage incidents, changes, and requests in a way that fits naturally into how they already work with tools like Jira and Confluence. Customers expect a seamless service experience powered by automation and AI. The result is a wave of teams adopting tools like Jira Service Management to get everything they need in one place without slowing down.

Top 10 HIPAA-Compliant Messaging Apps (2025): A Guide to Secure Healthcare Communication

Secure communication in healthcare is no longer optional. With patient data, lab results, and care coordination increasingly handled over mobile and digital channels, hospitals and clinics need tools that keep messages safe and compliant with HIPAA regulations. A HIPAA-compliant messaging app goes beyond standard texting apps, offering encryption, audit trails, and signed Business Associate Agreements (BAAs) to meet the requirements of the HIPAA Security Rule.

Signal Enrichment: Turning Noisy Alerts into Actionable Intelligence

This is the fourth post in our series on the future of incident management, which builds upon The Future of Incident Management: Your Blueprint for Operational Excellence, How Native Process Automation and Auto-Remediation Drive Operational Excellence, and Service Intelligence is the Future of Proactive Incident Management.
Sponsored Post

Top 10 Reasons Why You Need a Status Page Aggregator

Managing dependencies on multiple third-party services has become a critical challenge for modern engineering teams. A status page aggregator solves this by centralizing monitoring across all your vendors' status pages into a single dashboard, giving you real-time visibility into potential issues before they impact your users. Whether you're managing a complex microservices architecture or simply relying on various SaaS tools, understanding when and why your dependencies fail is crucial for maintaining service reliability.

Introducing the BigPanda observability tool rationalization framework

Enterprises face spiraling observability costs. Gartner reports a 20% year-over-year rise in spending, with the median spend per observability tool reaching $800,000 annually. The average organization using BigPanda coordinates data from ~20 different observability solutions, each taking up an ever-larger portion of IT budgets.