Operations | Monitoring | ITSM | DevOps | Cloud

RedIron: Unifying Alerts and Notifications in IT

RedIron Canada, a Managed Services Provider (MSP), Retail Integrator, and Solutions Provider, that specializes in managing cloud-based systems across AWS, Azure, and Oracle. Their expertise in IT monitoring and managed services makes them a trusted partner for retail businesses across North America. RedIron relied on traditional alert notification methods like email and SMS for their IT monitoring operations.

PagerDuty Runbook Automation 2024 Year in Review

Special guest Jeff Hausman, PagerDuty’s Chief Product Development Officer kicks off our 2024 recap for PagerDuty Runbook Automation and Rundeck Open Source. Then Jake and Forrest take us through all of the amazing improvements and new features added to the product, including shout outs to the amazing folks contributing to the Open Source repos and a customer success story from Ryanair.

A Plan to Achieve IT Resilience

Ensuring your organization can continue running critical services, even during unexpected challenges, requires a solid IT resilience plan. An IT resilience plan involves more than just traditional disaster recovery. It focuses on keeping vital applications, data, and business operations intact no matter what happens. In this guide, we’ll explore key components and best practices to help you establish a comprehensive plan for ongoing business continuity.

January 2025 Product Update - Easier Onboarding, Better User Experience, and Reliability Improvements

For the last two months, we have focused on improving the onboarding experience for users so that they can get started with monitoring with minimal effort. We have also added several improvements in the backend to make the service more robust and reliable. Some of the usability improvements are driven by user feedback. Others incorporate what we would personally like to see in such a monitoring service. We have also improved the dashboard user experience.

Enhancing Your Developer Experience: New SDKs for TypeScript, Go, and Terraform and Improved API Documentation

We built FireHydrant to be the kind of platform we’d want to use as developers, giving you the same tools and flexibility we rely on every day. With over 350 publicly accessible API endpoints, we’ve always believed in giving developers the power to customize and extend our platform to meet their exact needs.

What's New: Supercharge workflows with Message Templates

We’re excited to introduce Message Templates, a powerful new feature designed to streamline communication and ensure consistency across teams. With pre-configured templates curated by Enterprise Administrators, OnPage phone app users can now send standardized messages with just a few taps—saving valuable time and reducing the risk of miscommunication in critical situations.

This Month in Datadog: Datadog On-Call is now generally available

Datadog is constantly elevating the approach to cloud monitoring and security. This Month in Datadog updates you on our newest product features, announcements, resources, and events. To learn more about Datadog and start a free 14-day trial, visit Cloud Monitoring as a Service | Datadog. This month, we put the Spotlight on Datadog On-Call.
Sponsored Post

The Evolution of Enterprise Incident Management

In today's fast-paced digital era, ensuring seamless operations is more critical than ever for enterprises. Systems are more complex, customer expectations are at an all-time high, and the margin for error has dramatically narrowed. The way organizations respond to and manage incidents has undergone a remarkable transformation. From the reactive approaches of the past to the AI-driven, proactive strategies of today, enterprise incident management has evolved to meet the challenges of a rapidly changing technological landscape.

Learnings from eight major outages of 2024 and best practices to stay prepared

While we cannot eliminate internet outages, lag, or security breaches, reflecting on the lessons learned from these events helps us cope, innovate, and implement measures to reduce how often they occur. In 2024, website and application outages had a significantly greater impact on the world than in previous years, leaving the IT community with valuable insights to consider.

How to streamline ITIL processes for incident management

Are you facing challenges with incident routing, lengthy resolution times, or inconsistent team communication? If so, the IT Infrastructure Library (ITIL) can help. It’s a proven framework that goes beyond fundamental incident management to improve IT reliability, speed up issue resolution, and enhance overall IT service delivery. ITIL processes can help you save time, resources, and headaches.

How to migrate to Grafana IRM: find the right path for your organization

Hundreds of organizations have migrated from legacy incident response tools to Grafana IRM in recent years as they look to improve production reliability, reduce costs, and consolidate their tooling. Grafana IRM, our incident response and management product, has helped organizations such as LATAM Airlines simplify stressful incidents with observability-native workflows, but every organization has its reservations about the actual migration process.

The AI Revolution in Incident Management: Insights from the Frontlines

Cofounder Doreen Jacobi spoke with several of our customers about the revolution AI is bringing to incident management. Artificial Intelligence has seamlessly integrated into our daily lives, often in ways we barely notice. But what does that actually mean for industries facing complex challenges, like incident management? What real benefits does AI bring today, and how might it shape the future?

Feature Spotlight - Incident Insights

To help mitigate and resolve incidents even faster, our AI-powered incident insights provide immediate and actionable suggestions during the response process and provide additional context during post-incident reviews. From the Insights panel, you can easily review suggested resolvers and information about similar incidents that may help with the resolution process. To help further speed things up, when an insight is more likely to help resolve the incident, it's displayed as a popup in the Incident Console.

The Domino Effect of Outages with Nuno Tomás, Founder of isDown.app

Humans of Reliability: Keeping systems up and the lights on isn’t just about technology—it’s about the people behind it. In this episode, we’re thrilled to chat with Nuno Tomas, founder of Isdown.app, a vendor outage monitoring tool transforming how teams handle third-party incidents. Nuno shares his journey from software engineer to entrepreneur, the pivotal 4 a.m. moment that inspired Isdown, and the challenges of balancing startup life with family. We dive into the complexities of incident communication, how to tackle alert fatigue, and why transparency is key to building trust in SaaS.

5 IT Myths That Are Costing You Time and Money

In the fast-paced world of IT operations, myths often masquerade as truths, leading organizations down inefficient and costly paths. Let’s look at five of the most pervasive myths and explore why modern solutions like PagerDuty Operations Cloud are essential for thriving in today’s complex IT environments. Myth 1: Kubernetes is self-healing, and no other tools are required. The Reality: While Kubernetes is often touted as a self-healing platform, this is only partially true.

Accelerate incident triage with AI-Powered Event Management

IT Operations teams must detect and address incidents quickly to ensure efficient operations and reliable IT infrastructures. As organizations grow and scale their service offerings, their IT environments inevitably become more complex. Filtering through alerts becomes increasingly challenging due to excessive noise and a lack of end-to-end visibility. As a result, IT operations teams are forced to escalate issues more frequently.

ServiceNow Integration Now Generally Available (Plus, Inbound Field Mapping)

We’re thrilled to announce that our ServiceNow integration is now generally available (GA). For enterprises that rely on ServiceNow to power their ITSM, this integration creates a seamless bridge between engineers responding to incidents in FireHydrant and the broader organization. At FireHydrant, we are committed to delivering enterprise-grade solutions that go beyond the basics.

Ask the Expert: Heath Newburn on Balancing Innovation, Compliance and Resilience in Financial Services

Financial entities face increasing complexity in digital operations, making resilience, compliance, and incident management more critical than ever. Heath Newburn, PagerDuty’s Global Field CTO, shares his expertise on tackling these challenges, balancing innovation with compliance, and building operational resilience for lasting success. What are the top challenges that you hear from IT leaders in regulated industries like financial services?

Overhauling PagerDuty's data model: a better way to route alerts

Since its launch in 2009, PagerDuty has been the go-to tool for organizations looking for a reliable paging and on-call management system. It’s been the operational backbone for anyone running an ‘always-on’ service, and it’s done the job well. Ask anyone about the product, and you’re all-but-guaranteed to hear the phrase “it’s incredibly reliable.” I agree. But reliability isn’t everything.

11 DevSecOps Benefits & Value to Your Business

Data security and DevSecOps should be top priorities for every business, but some of us may fear the complexities of implementation. Many organizations are still shelving security concerns in favor of quick IT upgrades and software development. Security is no longer optional. Changes to the laws that govern the collection and use of personal information have forced many to prioritize security sooner rather than later.

Managing IT operations during a crisis

As work environments for entire industries continue to evolve between on-site, remote, and hybrid models, the performance of IT operations (ITOps) teams is more critical than ever. If you need proof, just remember the global impact of the CloudStrike outage. Operations teams must monitor, triage, communicate, and manage incidents 24×7 across all services. SaaS, legacy on-premises, and homegrown tools and systems are all stretching to meet business demand. Customer expectations are ever-increasing.

ITOps and ITSM are ripe for CIOs looking to adopt GenAI

In a recent webinar, BigPanda CEO Assaf Resnick noted that for the last 15 years, CIOs staked their reputations on how effectively they could move their enterprises to the cloud. Assaf predicts CIOs will focus on integrating generative AI into their enterprises over the next 10 years to deliver tangible business value. IT operations (ITOps) and IT service management (ITSM) offer significant opportunities to incorporate AI to enhance and accelerate their processes.

Building Resilience and Compliance in Finance: Insights from PagerDuty's Lee Fredricks

In an era where regulatory frameworks like DORA and FCA PS21/3 and PRA PS6/21 demand higher standards for financial resilience, European financial entities face growing pressures to ensure compliance and operational excellence. To understand these challenges, we spoke with Lee Fredricks, Director of Solutions Consulting for EMEA at PagerDuty.

Enrich your on-call experience with observability data at your fingertips by using Datadog On-Call

The stress, sudden disruptions, and high stakes of resolving issues while on call is one of the most challenging aspects of an engineer’s job. Many organizations, from startups to large enterprises, still struggle with their on-call experience, which leads to longer resolution times and lower employee retention rates. Constant context switching, managing multiple tools, and racing against time to resolve issues can cause frustration, burnout, and inefficiency.

2025 Starts Here: PagerDuty Innovations to Help You Tackle What's Next

As we enter 2025, we reflect on a year committed to innovation and customer success at PagerDuty. In 2024, we introduced capabilities that empowered operations teams to mitigate risks, protect customer trust, and improve business outcomes. From managing global outages to addressing complex digital operations, the PagerDuty Operations Cloud enabled organizations to respond faster, work smarter, and build operational resilience.

The Impact of Artificial Intelligence on Modern Software Development

Artificial intelligence (AI) is reshaping industries, and software development is no exception. By integrating AI technologies like machine learning, generative AI, and natural language processing, development teams can optimize workflows, enhance code quality, and reduce time-to-market. In this article, we’ll examine AI in software development, including its benefits, challenges, and most recent developments. Let’s get started.

Notify clients about incidents using AI

During the heat of incident response, staying focused on resolving the issue quickly is essential. Crafting clear and accurate incident updates, however, can be challenging under pressure. That’s where ilert’s AI-powered incident communication feature makes all the difference. This feature is a part of the ilert AIOps add-on.

xMatters Yars' Revenge Release

If you’re not an expert in destroying energy shields, dodging enemy swirls, or using space cannons to avenge your home planet like players in Yars’ Revenge, don’t worry! Our latest release is here to help you focus on fighting incidents that are a little more down to earth! Let’s take a look at some of the new features you’ll find in your incident-fighting arsenal.

How data habits help build a data culture

It's no secret that building a data-driven culture in a company is hard, but what is it exactly that makes this such a tricky endeavor? Contrary to popular belief, technology isn't the main hurdle. A recent survey reveals that only a quarter of respondents cite technological limitations as the primary obstacle to becoming data-driven.

What is Alerting?

What is Alerting? Alerting is a central component of modern safety and operating concepts. It is used to act quickly and effectively in hazardous situations. From operational alerting in operations management to alerting the population, there are various scenarios that cover specific requirements and areas of application. In this article, we provide an overview of the various alerting methods and their significance.

The three pillars of observability

Do you feel you’re always playing catch-up with incidents? If so, you’re not alone. As IT environments become more complex, alerts keep piling up, and finding the root cause feels like searching for a needle in a haystack. And ITOps and incident responders are left scratching their heads and wondering: what went wrong? It can be frustrating when you don’t have end-to-end visibility into your systems. This is where observability comes in.

Kickstart your investigations and reduce alert noise with Doctor Droid's offering in the Datadog Marketplace

Being an on-call engineer is often overwhelming, requiring you to pivot between tickets, dashboards, runbooks, and different data sources as you try to separate legitimate incidents from unnecessary noise. Not only does the process of investigating irrelevant alerts take time away from remediating important issues, but it also compounds alert fatigue.

Accelerate Incident Investigation with Biggy AI

Meet BigPanda Biggy AI, the interactive AI that’s purpose-built for incident responders. Powered by BigPanda’s AI-powered ITOps and incident management platform, Biggy streamlines troubleshooting for incident management by aggregating data such as observability tools, service history, informal and institutional knowledge, and more.

Introducing Alert Grouping: Less Noise, More Signal

Imagine this familiar scenario: it’s 2 a.m., and a critical service goes down. Your phone starts buzzing nonstop with alerts — all essentially saying the same thing. It’s overwhelming, distracting, and makes it that much harder to focus on fixing the problem. Enter Alert Grouping — it’s our smarter way to manage alerts, designed to help you cut through the clutter and focus on what matters.

Ops Centric AI: The foundation of best-in-class incident management

Your ITOps and Incident Management teams face thousands of alerts daily. How can they find the “needle in the haystack” to prevent critical alerts from escalating into incidents that impact users and customers? This challenge plagues modern IT departments as alert noise, fragmented data, and chaotic workflows extend response times and undermine service reliability.

On-Call Scheduling Software - which is the best in 2025?

Managing on-call schedules is a critical challenge for many industries, including healthcare, IT, customer support, and emergency services. As technology evolves, on-call scheduling software has become an essential tool for streamlining workflows, reducing burnout, and improving team efficiency. In 2025, the best on-call scheduling software not only simplifies schedule creation but also integrates with other tools, enhances communication, and ensures compliance with labor laws.

What is observability?

Modern IT environments are complex and interconnected, making observability essential for maintaining system and application performance. The challenge is not just about ensuring systems run smoothly; it’s about understanding the complicated web of data, services, and user interactions that drive your operations. This is where observability comes into play. Observability offers a deeper understanding of why issues arise in the first place.

The top three insights from Gartner IOCS 2024

BigPanda was honored to be a premier sponsor of Gartner’s IT Infrastructure, Operations & Cloud Strategies Conference (IOCS) in Las Vegas, Nevada. This event allowed us to showcase the latest BigPanda capabilities, connect with industry leaders, and gain valuable insights into the future of IT operations. For those who couldn’t attend, here are the three most impactful insights from my conversations with the customers, vendors, and analysts at IOCS 2024.

Top 5 outages detected by StatusGator in December 2024

As we step into the new year, we’re excited to continue providing early detection and updates for the services you rely on. But before we dive into 2025, let’s take a moment to recap some of the most notable outages from December 2024. From login issues to platform-wide disruptions, December was eventful, and StatusGator was there to keep users informed ahead of time. Here’s a look back at the top outages we detected.

7 Incident Communication Templates (+ Best Practices)

In today's tech world, clear communication during incidents is crucial. Whether it's a small issue or a major outage, how you communicate with stakeholders can build trust and speed up resolution. This post explores the essential elements of incident communication templates, providing a straightforward guide to crafting clear and concise messages. From planned maintenance to critical system failures, we'll cover a range of templates for different situations, so you're prepared for anything.

The Benefits of On-Call Management Software

In today’s fast-paced business environment, ensuring that critical issues are addressed promptly is essential for maintaining operational efficiency and customer satisfaction. On-call management software plays a pivotal role in organizing and scheduling teams to respond to emergencies or urgent situations at any time, but especially after business hours when offices and operations centers are not or sparsely staffed.

ChatGPT Outage: How StatusGator notified before OpenAI and Microsoft

On December 26, 2024, A ChatGPT outage disrupted access for countless users worldwide. This was a major outage affecting not just the ChatGPT web interface but the entire OpenAI platform including their APIs. The incident was traced back to a power issue in Microsoft Azure’s South Central US data center which took down many other Azure customers. StatusGator customers received Early Warning Signal notifications before either provider updated their public status pages.