Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Reliability Matters. Blameless is Growing with Series B $30M Funding

When Blameless started in 2018, the team set out on a mission to help all engineers achieve reliability with less toil and risk. Three years in, that mission has become more important than ever. What has changed is the rate of SRE adoption, now the fastest growing team and practice inside engineering. This represents a clear recognition of the many upsides that an SRE practice brings with its combination of continuous learning, velocity, and resilience.

What's New: Introducing Next-Gen ChatOps With PagerDuty and Slack

In this new world of digital everything, new application versions usually mean that you’re going to get bigger and better features, more capabilities, and an uplifted user experience, right? When I talk to customers, many can’t wait to upgrade the PagerDuty integrations that they depend on to test new features. If you’re a PagerDuty for Slack user, the next-generation version of our Slack integration will certainly be an exciting development.

Getting over on-call anxiety

You've joined a company, or worked there a little while, and you've just now realised that you'll have to do on-call. You feel like you don't know much about how everything fits together, how are you supposed to fix it at 2am when you get paged? So you're a little nervous. Understandable. Here are a few tips to help you become less nervous.

Experiencing Turbulence? Hypercare Helps Travel and Hospitality Firms Manage Sky-High Demand

Many sectors suffered during the COVID-19 pandemic, but the travel and hospitality industry was struck particularly hard as the world went into lockdown and governments urged us to stay home. According to the International Air Transport Association, global air passenger demand in 2020 was down a record 65.9% from the previous year, and the tourism industry saw an estimated loss of 100.8 million jobs worldwide.

How to Reduce Alert Fatigue: Preventing Noisy Alerts and Error Messages

Monitoring solutions are a vital component in managing an application’s environment. From the systems layer all the way up to the end user’s connection to the app, you want to find out how the platform is performing. Indicators like CPU, memory, the number of connections, and overall health help teams make informed decisions for guaranteeing uptime. Teams monitor metrics (short-term information) and logs (long-term information) mainly from a reactive perspective.

How Grafana helps organizations manage SLOs across multiple monitoring data sources

“SLO is a favorite word of SREs,” Grafana Labs Principal Software Engineer Björn “Beorn” Rabenstein said during his talk at KubeCon + CloudNativeCon NA 2019. “Of course, it’s also great for design decisions, to set the right goals, and to set alerting in the right way. It’s everything that is good.” So what happens when things go bad?

PD Summit21: Transforming Infrastructure Teams Through Observability

What is this ""observability"" thing that everyone is talking about? Observability allows you to navigate the dark unknowns with echolocation while others attempt to fly blindly without it. Are your dashboards all green, but you still have an issue brewing? Do you need instant feedback based on the Core Analysis loop? Are your engineers tired of waking up at 3 AM for the expected issues? Is there a lack of time for experimentation? Generate your own answers and create a meaningful course of action with observability.

PD Summit21: The Netflix Reliability Story: A Brief History of How We Evolved Resilience to Failure

In Netflix engineering, we’re driven by ensuring Netflix is there when you need it to be. We strive to provide a service that people love and can enjoy anytime, anywhere. An important foundation for bringing our customers joy is a strong focus on reliability that ensures Netflix will be available when they need it. In this talk, I’ll tell the story of how we've grown our reliability practices over time to meet the changing demands of microservices and distributed computing.

PD Summit21: Adopting and Maturing to Service Ownership with PagerDuty and Rundeck

Among the common goals of today's engineering and operations teams is to adopt a culture of service ownership: ""You build it, you own it."" As with many ancillary objectives to driving DevOps across an organization, this is easier said than done. Sometimes this is in small part due to the technology stack/architecture of a given company. But more often than not, this is because teams lack the human-to-technology mechanisms that allow for a culture of service ownership.