Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

On-Call Management

On-call management is a process for managing after-hours support. Cloud on-call scheduling tools allow self-service and mobile access. Multi-channel communications (email, SMS, phone, mobile push notifications and chat) ensure that the alert gets through. AlertOps sends rich alerts, so the on- Call support engineer has all the information they need to know.

Alert Escalation

An alert escalation can be triggered when the primary support engineer does not respond to or acknowledge an alert within the escalation policy time limit. Keeping managers and stakeholders informed during an incident can help improve confidence in the support team. Once an escalation policy has been established, alert escalations can be automated to ensure consistency.

Why an Incident Commander is crucial to ITOps

It may be counterintuitive to tackle a problem without knowing exactly what the problem is, but an incident commander often does just that. In fact Rob Schnepp—founding partner at Blackrock 3, an Alameda, California-based incident management consulting group—says identifying the root cause of an incident is typically secondary to addressing the symptoms.

Take a deep dive into Incident Intelligence

ITOps professionals know that their AI and automation goals can only be achieved with high-quality data. How can you get good-quality data? Incident Intelligence. In this on-demand session from Pandapalooza, our Group Product Manager, Orr Ganani, joined our Regional VP of Professional Services Sales, Jordan Gamble, to discuss Incident Intelligence and its benefits. Read on to learn more about Incident Intelligence from this webinar.

What is SOC 2 Compliance? | A Guide to SOC 2 Certification

We’re excited to announce that Blameless is officially SOC 2 compliant! This is part of our larger efforts to assure all the users of Blameless and visitors to our site that we’re meeting and exceeding all of your privacy and security needs. Learn more by visiting our security page! When choosing a service, it’s important to have trust in the provider – especially for something as important as your incident management.

Squadcast + Auvik Integration: Routing alert made easy

Auvik is a cloud-based network management software that gives you instant insight into the networks you manage and automates complex and time-consuming network tasks. If you use Auvik for network management, you can integrate it with Squadcast, an end-to-end incident response tool, to route detailed alerts from Auvik to the right users in Squadcast. This blog is a step-by-step guide that will help you set up Squadcast-Auvik Integration.
Sponsored Post

Best practices when managing an outage

There's never a good time for a service outage. And, from the moment it hits, it starts affecting your stakeholders. Suddenly, essential daily tasks are curtailed while your team enters emergency response mode. However, the surest way to mitigate damages and recover quickly is to follow a set of best practices. It's far better to plan for an outage. But if you wait until it happens before you start developing a response, you will be far behind where you need to be for a quick resolution. This guide will help you create a set of best practices for your organization. This will help you work toward faster and more effective responses.

Implementing SLAs, SLIs, and SLOs: A guide to monitoring best practices

Implementing SLAs, SLIs, and SLOs is essential for effective monitoring and maintaining optimal system performance. As companies grow, they may add a significant number of KPIs that burden their IT assets, leading to system sluggishness and employee complaints. Developers must balance business needs with IT processes, and SLAs, SLIs, and SLOs can help them achieve this balance.

Top 6 Tips for Improving MTTx

In our research for the inaugural State of Availability Report, we asked 1,900 engineers about mean time to detect (MTTD) and mean time to recovery (MTTR) as two leading incident management Key Performance Indicators (KPIs) strongly associated with availability. We learned that less than 15% of respondents are tracking their MTTD. It takes twice as long to discover an issue than it does to resolve it.