Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Safeguarding Operations: A Comprehensive Guide to Disaster Recovery and Business Continuity for Data Center Managers

In the dynamic world of data center operations, preparedness is key. This blog serves as a comprehensive guide for data center operations managers, exploring the critical aspects of disaster recovery (DR) and business continuity (BC) planning. Learn how to fortify your data center against unforeseen events and ensure seamless operations even in the face of adversity.

The Debrief: Building AI-Related Incidents

Recently we went live with one of our biggest product launches to date AI. And this product was unique in that it was broken up into four smaller projects: So naturally most folks might be wondering: What were the biggest differences between these projects and what went into actually building out each of these features? In this episode, you'll hear from Rob and Isaac, both Product Engineers who played a really critical role in the building out of related incidents, to get a peek behind the curtain.

APAC Retrospective: Learnings from a Year of Tech Outages, Restore: Repair vs Root Cause

As our exploration of 2023 continues from the third-part of our blog series, Dismantling Knowledge Silos, one undeniable fact persists: Incidents are an unavoidable reality for organisations, irrespective of their industry or size. Recent APAC trends show that regulatory bodies are cracking down harder on large corporations for poor service delivery, imposing harsh penalties as a result of the negative consequences.

Finding relationships in your data with embeddings

With the world still working out the limits of LLMs and ever more powerful models being released each month, it’s a little hard to know where to begin. Whether it’s summarising and generating text, building a useful chat assistant, or comparing the relatedness of strings with embeddings, almost all of this now can be done via a few simple API calls. It has never been easier to incorporate these new technologies into your own product.

5 Cloud Outages Tracker Tools To Monitor Vendors in 2024

Whether you’re a business owner, a tech enthusiast, or simply a user who relies on cloud services for daily tasks, the cloud outage tracker can be a useful tool. It informs you of downtime, degraded performance, and maintenance of services that modern businesses rely on. Here’s the list of cloud outage tracker tools that can help you prepare for and mitigate the effects of inevitable disruptions in the cloud.

Building a GPT-style Assistant for historical incident analysis

Like most things, our AI Assistant started out as an idea. One of our data scientists, Ed, was working with our customers to improve our existing insights. But the most common theme that kept surfacing was the wide-range of use cases that our customers wanted to use insights for. Using this user feedback as our inspiration, we came up with the idea of a natural language assistant that you can use to explore your incident data.

The Debrief: incident.io, say hello to AI

This week was a particularly exciting one for us at incident.io. We launched not one, not two, but four AI-powered features to help folks get the most out of their incidents. In this episode of The Debrief, we sit down with Ed Dean, Product Analyst, and Charlie Revett, Product Engineer, to talk through all of these features and discuss how they're already making a measurable impact. You'll also hear them talk about: You can learn more about our AI features here.

Terraform Time | Distribute PagerDuty config utilising Terraform Remote State

We'll explore how to distribute PagerDuty configuration between multiple repositories leveraging Terraform Remote State feature. You will be able to access the code written during this Terraform Time episode in the following Github repository.