Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Our Opsgenie integration is now available

When we detect a problem with your site we can notify you via mail, a slack message, a webhook, or any of our other notifications channels. For most of our users this is enough, but those work in larger teams often need more flexibility. Today, we are launching our Opsgenie integration, a modern incident management platform.

Squadcast's Improved Slack (V2) Integration | Better Collaboration & Incident Management | Squadcast

This video will give you an overview of the latest improvements supported by the Squadcast-Slack integration, which we hope will help in better collaboration and Incident Management.

Why Incident Management is an Essential Part of Risk Management

In any operation or activity, unforeseen happenings can derail progress. The job of a good manager is to try their best to make the hitherto unforeseen visible and planned for. It’s all too easy to find yourself reacting to occurrences that can throw you and the company into turmoil, with frantic fixing on the back foot being the result. The best managers can make it look like they don’t do much.

See Global Event Orchestration End-to-End

Global Event Orchestration’s powerful decision engine enriches events, controls their routing, and triggers self-healing actions based on event data. Teams can use this functionality across any or all services within PagerDuty. This feature is a continued investment in Event Orchestration, demonstrating PagerDuty’s commitment to providing customers with best-in-class automation capabilities. Check out this live demo from Principal Product Manager Frank Emery.

Assembly time is where you have the most control of an incident

The FDNY EMS Command responds to more than 4,000 calls per day. They range from car accidents to building fires to cats stuck in trees, and responses vary accordingly. Sometimes they might take hours, sometimes they take just a few minutes. With such unpredictable conditions, the FDNY focuses on improving what they call “response time.” That’s the amount of time between a 911 call being made and emergency responders arriving on the scene. This might sound familiar.

Trust shouldn't start at zero

How often have you heard the phrase “trust is earned” in life? While well-meaning, I think this can actually lead to some strange behaviour at work, especially when you’re on a fast growing team. Startups experience a lot of chaos and unknowns your teams need to navigate, so it’s vital to know you can trust the people around you. As you grow, how you set expectations around trust as people join your team can impact your ability to hire, onboard, ship and ultimately, survive.

Debugging Kubernetes with Automated Runbooks & Ephemeral Containers

In our previous blog, we discussed the difficulty in capturing all relevant diagnostics during an incident before a “band-aid” fix is applied. The most common, concrete example of this is an application running in a container and the container is redeployed—perhaps to a prior version or the same version—simply to solve the immediate issue.

10 Mistakes to avoid when framing your IT Incident Management Strategy

An IT incident is an unplanned disruption that negatively impacts an IT service. As the importance of IT to the business has increased, the impact of IT incidents has become greater. IT incidents can result in revenue loss, loss of employee productivity, SLA financial penalties, government fines, and more. An effective IT incident management strategy is now essential in every organization. For a business like Amazon whose entire business relies on IT, a single second of slowness can cost over $15,000.

How to get started with incident management metrics

Tracking incident metrics can help you discover patterns in the causes and costs of incidents and help you understand brittle parts of your organization. We've seen them help teams zero in on things like: But it can be intimidating to get started. Do you really need metrics if you're a small team or just beginning to formalize your incident management program? I say yes. The key is to start with something manageable and grow.

How Abbott transformed its incident management process with Workflow Automation

Eliminating errors and streamlining the incident management process are top priorities for many ITOps, NOC, SRE, and DevOps teams. With organizations using multiple tools in their IT stack, manually finding the right information at the right time becomes crucial during incident triage. By automating tasks and workflows, businesses can eliminate manual tasks that are time-consuming, repetitive, and prone to mistakes.