Operations | Monitoring | ITSM | DevOps | Cloud

Panel: Handling Incident Response - Dash 2021 (Datadog, PagerDuty)

When customer-impacting downtime happens, it’s crucial that responders are prepared and can resolve these issues as quickly as possible. Knowing the right tools to use, from wherever you are working from, will help to have a well-defined strategy in place to come together as a team, work the problem, and get to a solution quickly. In this roundtable discussion, PagerDuty and Datadog engineers chat about incident responses and how we use all the tools at our disposal to respond quickly and effectively.

Is it a ghost or is it Flow Designer?

Maybe it’s the time of year or the change in temperature, but sometimes using xMatters Flow Designer can seem a little… spooky? Maybe it’s the unlimited capability it offers, or maybe it’s that it can make changes for you without you being aware they’re taking place. But every once in a while, we’re not sure if we’ve just set up workflows too effectively, or that something a touch paranormal is happening with xMatters.

Improve your on-call experience with Datadog mobile dashboard widgets

Life happens—even when you’re on-call. You can’t take your laptop everywhere, but whether you’re on the train, at dinner, or at the gym, you can count on the Datadog mobile app for access to key data about the status and performance of your applications. Now, you can use Datadog mobile widgets to build an on-call mobile dashboard directly on your phone’s home screen, so it’s even easier to track the data you care about from anywhere.

Strategies to Reduce Hospital Readmission Rates

The Centers for Medicare & Medicaid Services (CMS) scrutinizes hospital readmission rates across the U.S. each year, and it levies financial penalties on organizations that overshoot acceptable hospital readmission rates. As healthcare systems across the country embark on a journey to introduce patient-centric models to their organizations, they must align their resources with ever-changing regulations for them to thrive.

Now Available: Private Slack Channels

Ever heard the saying “Too many cooks”? If you’ve responded to incidents, you’ll likely understand the parallels. There are cases when incident command on a public channel isn’t the best option: Whatever your reason, we’ve got you covered. Now available, users can spin up a private slack channel for an incident. Read more how to do this here.

Differences between Site Reliability Engineer Vs. Software Engineer Vs. Cloud Engineer Vs. DevOps Engineer

The evolution of Software Engineering over the last decade has lead to the emergence of numerous job roles. So how different is a Software Engineer, DevOps Engineer, Site Reliability Engineer and a Cloud Engineer from each other? In this blog, we drill down and compare the differences between these roles and their functions.

Customer Service Ops & PagerDuty Zendesk Integration v3 Full Case Ownership Use Case

PagerDuty's Zendesk Integration enhances communication between engineering and support teams by providing visibility to high-impact incidents via the PagerDuty Status Dashboard that is integrated into the Zendesk interface. Automate workflows for a fast-paced support team and provide the right level of information so they can interact knowledgeably with their customers while also reducing time and effort.

PD, Salesforce Service Cloud, Slack: Proactive Case Escalation & Slack-First Intelligent Swarming

Learn about and see how PagerDuty, Salesforce Service Cloud, and Slack empower collaboration across your organization to accelerate time to resolution. Proactively improve customer satisfaction in real time and break down silos to connect customer service teams with engineering teams to address incidents quickly when seconds matter. Enjoy greater control when resolving issues and anticipating customers' needs through an incident command console that gives customer service agents and stakeholders instant updates on critical, customer-impacting issues.

Five steps to better customer communication

When you’re deep into an incident and there’s alerts firing, decisions to be made, and people to escalate to, it’s easy for outward communication with your customers to fall off the priority list. In many regards this makes sense; it seems natural to put all of your focus and energy into minimising the impact and getting things back on track as soon as possible.

What's New: Extending our Datadog Capabilities With New PagerDuty Widgets

In the last two years, we have seen the rise of remote and hybrid work, and with that, a proliferation of tools and apps needed to support critical communication and collaboration. Finding that app-life balance has become increasingly complex, so simplifying “how” we work is key for every organization.

Why ChatOps & Incident Management are the Perfect Pair

ChatOps has become an integral part of software development and IT operations, as teams rely on automated notifications to take the place of manual alerts. In the past, if there was an alert, someone would need to manually find that notification. Then, they would have contact team members to notify them one by one so they could start working on a resolution. In this complex network of communications, it was easy to lose information, duplicate work, and simply waste time coordinating the team.

Service Profile: Activity Tab Updates

PagerDuty's new service profile enhancements allow you to better command and control incidents directly from the Service Profile. Now you can perform bulk actions on incidents like acknowledge or resolve, search by incident ID, add and view change integrations, browse resolved incidents, view related escalation policies from the service profile header, and more.

Next Generation Slack Migration Tool and Stakeholder Updates Demo

Learn more about PagerDuty's Collaboration Applications that help you streamline incident remediation. Enjoy these demos of our latest updates to our PagerDuty Slack and Microsoft Teams Applications including the Webhook Migration Tool, Stakeholder Updates, and Resolution Notes.

Runbook Automation: Rundeck Service Ownership Demo

Learn how PagerDuty Runbook Automation enables developers and service owners to equip other engineers, such as operations engineers or other developers with mechanisms to help them support their services. Service owners can allow other team members to help them in supporting their services via automated runbooks that enable others to apply short term fixes–reducing escalation to service owners.

When built-in alerting is not enough

Many ITOM or ITSM tools come with built-in features for alerting and notifications and are able to send at least an email or text notification upon incidents to operations teams. But is this enough reliability to respond to and handle major and critical incidents? Recently, we have been surprised to see more and more monitoring tools listed as alerting tools on review platforms like G2.

Automated Diagnostics for Incident Response Demo

Learn about how you can speed up resolution times with Automated Diagnostics. Automate away as much manual toil as possible to increase team productivity so teams can work more productively. Learn about how teams across the organization can embrace workflows that help to diagnose and remediate incidents.

OnPage Clinical Communication and Collaboration Platform

Modern healthcare teams require a modern solution to streamline clinical communications and medical workflows. In life and death situations, it’s critical that physicians receive immediate alerts and messages to provide patient care promptly. OnPage is the industry’s most trusted clinical communications platform. OnPage is more reliable and secure than traditional pagers. The system enables care teams to easily communicate and achieve maximum patient satisfaction.

Postmortem Pitfalls

Last week, we spent some time talking to Gergely Orosz about our thoughts on what happens when an incident is over, and you're looking back on how things went. If you haven't read it already, grab a coffee, get comfortable, and read Gergely's full post Postmortem Best Practices here. But before you do that, here's some bonus material on some of our points.

A developer's guide to programatically overcome fear of failure

People are more than happy to talk about their successes, but if you ask them about their failures, they can be much more hesitant to share. Failure is a subject that, interestingly enough, is entangled with the emotion of shame. Yet it’s integral to achieving anything novel, and the learnings that come from failure are unparalleled. So, let’s find ways to get more comfortable with failing, and figure out why people fear it.

Incident Management Metrics That Matter - 2021

What are the Key Incident Management metrics/ KPI ‘s? How important is it to track Your Team’s Performance? If you are not doing so already the time is right to get your finger on the pulse by better understanding and managing your organizations incident management key metrics. How a company manages IT Incidents matters and most importantly the process has the power to impact sales – recent studies indicate 52% of U.S.

Uptime/SLA calculator: what is an SLA and how to calculate it?

A Service Level Agreement (SLA) is a document that details the expected level of service guaranteed by a vendor or product. This document generally sets out metrics such as uptime expectations and any payoffs if these levels are not met. For example, if a provider advertises an uptime of 99.9% and exceeds 43 minutes and 50 seconds of service downtime, technically the SLA has been breached and the customer may be entitled to some type of remuneration depending on the agreement.

Intelligent Alert Grouping: What It Is and How To Use It

It’s 2 AM and you’re paged when you’re still awake – how well can you find what you need to fix the latest mistake? When the incident begins it might only be impacting a single service, but as time progresses, your brain boots, the coffee is poured, the docs are read, and all the while as the incident is escalating to other services and teams that you might not see the alerts for if they’re not in your scope of ownership.

What Operational Maturity Looks Like Today With PagerDuty's Kyle Duffy

Companies that underwent accelerated digital transformations during the past 18 months are looking to understand how they can improve their operational maturity to handle the increase in complexity. This is paramount to an organizations’ future success.

4 Pressures at Tech Companies xMatters Can Help Relieve

Technology companies are at the forefront of innovation, changing the way consumers and the general public interact with their everyday lives. As the late Stan Lee so wisely stated, “with great power comes great responsibility,” and this heightened pressure often leaves little room for error when an issue arises—which happens more often than you’d think.

OnPage for Clinical Communication and Collaboration

Modern healthcare teams require a modern solution to streamline clinical communications and medical workflows. In life and death situations, it’s critical that physicians receive immediate alerts and messages to provide patient care promptly. OnPage is the industry’s most trusted clinical communications platform. OnPage is more reliable and secure than traditional pagers. The system enables care teams to easily communicate and achieve maximum patient satisfaction.

Process binds technology and people in cloud maturity success

This is the final blog in our series focusing on CloudOps maturity, where we’ve been looking at the key findings from a recent IDC study, commissioned by PagerDuty. In our previous blogs, we discussed the people-based transformations and the technological changes that organizations must undergo to mature their CloudOps practices.

Sponsored Post

AIOps - What It Is, Why It Matters, and Advice for Adopting It

The link between DevOps and artificial intelligence for operations (AIOps) has only started to become clear within the last few years. Monitoring and alerting has evolved from a "black box approach," where you don't actually know what's happening, into observability, where you have access to data that provides everything you possibly need to know about your IT systems. How does AIOps come into play? AIOps is the practice of applying artificial intelligence, machine learning, and advanced analytics to automate and improve IT operations. Since it entered as a formal discipline with Gartner in 2016, IT teams have been trying to figure out how to employ it to make their lives easier.

Should you care about AIOps? Obviously.

There's a lot of hype in the marketplace about AIOps right now, and there's a lot of people who've got some interesting ideas about what it should be. The most common idea that I hear is that it's essentially a layer of AI magic that sits across everything that you've got in your IT tooling today and then make sense of all of that for you and then we'll decrease the number of incidents you have and reduce your MTTR...

Incident Management Process- 6 Tips to Better Prepare Your IM Process for The Holiday Season.

Holiday retail sales are likely to increase between 7% and 9% in 2021, according to Deloitte’s annual holiday retail forecast with holiday sales totaling $1.28 to $1.3 trillion during the November to January timeframe. Deloitte also forecasts that e-commerce sales will grow by 11-15%, year-over-year, during the 2021-2022 holiday season.

How Patient-Centered Care Improves Patient Outcomes

The patient-centered care (PCC) model enhances the way providers interact with patients during the care delivery process. Clinicians that show compassion and empathy toward patients are more likely to achieve meaningful, positive doctor-patient relationships. Indeed, care teams that prioritize PCC have a proven approach to improving patient satisfaction and increasing patient retention.

How Your ITSM Tool & PagerDuty Make a Dynamic Duo for Real-Time Work

There’s an incident. Your teams need to communicate with the development team that owns the service, but that team is too busy to stop and chat. Meanwhile, you in central IT have business leaders asking for updates, angry internal users calling the help desk, and customer service representatives asking for information. You have hundreds of tickets all pertaining to the incident in your ticketing system.

What SREs Can Learn from Facebook's Largest Outage

Facebook’s October 2021 outage was the type of event that gives SREs nightmares: A series of critical business apps crashed in minutes and remained unavailable for hours, disrupting more than 3.5 billion users around the world and costing about 60 million dollars. As incidents go, this was a pretty big one.

PagerDuty Integration Spotlight: Honeycomb

Honeycomb delivers observability for modern engineering and DevOps teams to observe, debug, and improve production systems efficiently. The PagerDuty + Honeycomb integration uses Honeycomb Triggers to notify on-call responders based on alerts sent from Honeycomb. This integration is maintained and supported by Honeycomb. Liz Fong-Jones from Honeycomb joined us live on Twitch to share more about how Honeycomb and PagerDuty can be used together to help your teams and to do some live investigation into Honeycomb’s own performance data.

4 xMatters Use Cases That May Surprise You

xMatters is part technology, part service reliability, and a little bit of magic. If you’ve spent time on the xMatters website, you’ll likely have seen a number of valuable use cases for the platform—it can alert SREs when there’s a website outage, it can accelerate product development for DevOps teams, it can manage on-call schedules and alerts for support teams.

The Cost of Increasing Incidents: How COVID-19 Affected MTTR, MTTA, and More

Digital transformation accelerated for many companies during the last 18 months. While it may have been on the agenda prior to COVID-19, teams were pushed to extreme speeds to digitize and meet the rising online demand. During this time, organizations learned important lessons that they’ll carry on with them into this new future. Leaders can take these learnings and use them to build better products, healthier and more efficient teams, and a happier customer base.

Monthly Moo Update | October 2021

There’s a number of monitoring and observability solutions on the market today. It almost reminds me of the automobile market and the endless number of automobiles available. Sure, they all get you from point A to point B, in some way. But some automobiles do it faster, smoother, more efficiently, with guidance, more comfort, storage space, perhaps towing capability, and even autonomously. Moogsoft is the automobile you’ve been dreaming about in the monitoring and observability market.

FireHydrant expands Reliability Platform with Service Catalog

Today, we are happy to announce the launch of Service Catalog to help you better manage, query, and learn about the services that exist in your infrastructure. At FireHydrant, we envision a world where all software is reliable, and we’re on a mission to help every company that builds or operates software get closer to 100% reliability. Service Catalog helps you get closer to 100% reliability.

PagerDuty Integration Spotlight: InfluxData

InfluxData is an Open Source Platform built for metrics and events — a platform that is purpose-built for time series data. The essential time series toolkit — dashboards, queries, tasks and agents all in one place. InfluxDB is even more programmable and performant with a common API across OSS, cloud and enterprise editions. Send events to PagerDuty to keep your teams informed. Check out InfluxData’s integration.

Facebook, Instagram, and Whatsapp's Outage - Understanding MTTR

Yesterday the most used social media platforms in the world were inaccessible for 6 hours straight. Later, in a press release, Facebook revealed that the outage was due to configuration changes in their routers. There is no doubt that Facebook has an intense incident response plan, yet a small blind spot resulted in a significant business interruption. So how do we avoid this? The truth is, outages and performance issues are bound to happen in any network.

PagerDuty Integration Spotlight: HashiCorp Terraform

Manage your PagerDuty account objects with Terraform! Reap all the benefits of infrastructure as code and give your teams the flexibility they need to manage their services in real time. As infrastructure stacks grow increasingly more complex and involve an ever-growing number of services and systems, teams have looked to abstract configuration to its own layer of code. This concept of configuring infrastructure as code is gaining traction throughout the industry for a variety of reasons.

The Aftermath of the Facebook 6-Hour Outage

Less than 24 hours ago, the world came to a “social standstill” as Facebook, and its sister companies, WhatsApp and Instagram, became unavailable, leaving its 3.5 billion users in a flap. The outage, which lasted almost 6 hours, shut off access for users and businesses all over the world and caused ripple effects that we will likely continue to see in the immediate (and perhaps not-so-immediate) future.

Evaluating Splunk On-Call Alternatives

Splunk On-Call (Formerly VictorOps) is a popular incident response and on-call management platform that allows engineering and operations teams to collaborate with ease and resolve issues faster. As part of the Splunk Observability Suite, Splunk On-Call is combined with related products to achieve the goal of bringing monitoring, troubleshooting, and investigation, into a single, comprehensive view — simplifying the process from incident detection to resolution.

PagerDuty Integration Spotlight: LogDNA

LogDNA’s Cloud logging platform helps your DevOps teams find and fix production issues faster so your teams can get back to doing what they do best, building amazing products. Send incident alerts from LogDNA directly to PagerDuty. Check out the LogDNA integration with PagerDuty to get started.

How Service Catalog Increases Productivity

Productivity is defined by measuring the amount of output over a given time frame. However, this discounts the quality of output, which is crucial in moving toward a more complete definition of productivity. Relating to services, increases in productivity generally highlight the amount of feature releases over time. This leaves out the critical measurement of quality compared to quantity. This is where a Service Catalog can greatly enhance true productivity within an engineering organization.

Learn where you rank and how it affects digital service resilience

We evaluated where enterprises are positioned in the Incident Management Spectrum and in their journey to digital service resilience and found that incident management needs its own transformation. In the report, you'll learn which approach to incident management is the best for meeting today's business imperatives.

Digital Transformation Secrets: Balancing Innovation and Uptime

Providing a superior digital customer experience is a critical component of business success for technology and digital service providers. But an enjoyable, effective, and reliable customer experience demands new IT architectures and places new expectations on the way SREs, development teams, ITOps, executives, and other previously siloed groups work together. And at what costs? To understand, we asked over 300 DevOps, ITOps and business leaders for perspectives.