%term

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Keep it Simple

Mar 19, 2021 By BigPanda In BigPanda

View Video

BigPanda

Read more about Keep it Simple

Phoenix Project: Sometimes you have to look back to look forward

Mar 19, 2021 By Paul Szymczyk In BigPanda

It has been eight years since The Phoenix Project was published and a lot has changed since then! I started to think about what we’ve learned in that time. It starts with the theory of constraints. I still see it all the time. Organizations take actions which are merely temporary, putting out fires but not solving for the underlying causes of those fires.

Read Post

BigPanda

Read more about Phoenix Project: Sometimes you have to look back to look forward

Mattermost Incident Collaboration now includes improved communication, automation, and history for incident response teams

Mar 18, 2021 By Ian Tao In Mattermost

Teams are always looking for a speed advantage, and that comes from planning, crisp execution, and teamwork. To this end, we’re excited to release new enhancements to Incident Collaboration to help make life easier for DevOps teams during incident response. The Mattermost platform includes built-in Incident Playbooks with predefined response plans and task lists. Playbooks can be customized to your environment and specific use cases.

Read Post

Mattermost

Read more about Mattermost Incident Collaboration now includes improved communication, automation, and history for incident response teams

Say goodbye to guessing: Introducing Automatic Incident Triage by BigPanda

Mar 18, 2021 By Mohan Kompella In BigPanda

Low MTTR is the much-desired nirvana-state in IT Operations. One of the most painful parts of the incident management lifecycle, which prevents the achievement of this nirvana, is triage: the time it takes first incident responders to determine the next action when facing a barrage of IT incidents. Why?

Read Post

BigPanda

Read more about Say goodbye to guessing: Introducing Automatic Incident Triage by BigPanda

PagerDuty for AIOps & Automation: Innovate & Automate Faster

Mar 17, 2021 By PagerDuty In PagerDuty

We continue to improve our AIOps and machine learning capabilities to help customers reduce noise, quickly identify root cause, and automate the resolution of critical, business-impacting issues. This will help organizations further increase cost savings, reduce mean time to resolution (MTTR), and preserve people hours. The following capabilities empower responders to gain control, deliver critical context for faster root cause identification, assess impact, and automate actions with minimal configuration.

View Video

PagerDuty

Read more about PagerDuty for AIOps & Automation: Innovate & Automate Faster

PagerDuty Enterprise Collab & Communication, Cloud Migration, & Customer Service: New Integrations

Mar 17, 2021 By PagerDuty In PagerDuty

We continue expanding our ecosystem of native integrations to help teams bridge the communication gap between customer service and engineering teams, embrace full-service ownership, and better manage cloud migration initiatives.

View Video

PagerDuty

Read more about PagerDuty Enterprise Collab & Communication, Cloud Migration, & Customer Service: New Integrations

BigPanda Automatic Incident Triage

Mar 17, 2021 By BigPanda In BigPanda

IT incidents often lack critical business context necessary to conduct triage, resulting in long incident management lifecycles and high MTTR. Automatic Incident Triage significantly simplifies and shortens triage by automatically adding actionable business context to incidents.

View Video

BigPanda

Read more about BigPanda Automatic Incident Triage

IT Incident Response is Improved with a Corporate Status Page

Mar 17, 2021 By StatusCast In StatusCast

To understand the impact that stovepipes have on incident response, one need look no further than the 9/11 terrorist attacks that occurred in the United States. The CIA, DoD, and FBI all knew about the Al Qaeda terror threats before the planes hit the World Trade Center, but the 9/11 Commission found that a lack of data and intelligence sharing among the agencies limited each agency’s understanding of the looming terrorist threat; thereby, limiting their incident response.

Read Post

StatusCast

Read more about IT Incident Response is Improved with a Corporate Status Page

Introduction to on-call schedules

Mar 16, 2021 By Pruthvi In Spike

An on-call schedule tells you and everyone in the team who will be the first responder when an issue happens in production. The on-call team member is responsible for investigating the issue, either fixing the issue herself or adding other people who can help fix it. Having an on-call schedule is important for building reliable systems because making someone responsible for production issues makes sure that they're not ignored.

Read Post

Spike

Read more about Introduction to on-call schedules

How to deal with alert noise

Mar 16, 2021 By Pruthvi In Spike

Adding alerts across your monitoring tools is taking a proactive approach to reliability. But if there are too many alerts, then it can become counterproductive because team members will start ignoring alerts or remove the alerting altogether. Which is why you need a systematic approach to adding alerts and dealing with them.

Read Post