Operations | Monitoring | ITSM | DevOps | Cloud

Incident.io

Why I like discussing actions items in incident reviews

Are incident reviews about learning or tracking actions? This question has sparked recent debate in incident management circles, including in my recent panel at SEV0 and in Lorin Hochstein’s post. Should the goal of an incident review be learning, or should it focus on tracking actionable improvements? When is the right time to discuss actions, and are they picked up just to make us feel better? From my experience, learning from incidents and identifying actions are inseparable.

incident.io is best in class for momentum, relationships and enterprise adoption

Trust doesn’t just happen overnight. For us at incident.io, it’s been a journey—one that’s focused on people just as much as the product. From the start, we knew that building great incident management software wasn’t just about creating features and functionality. It was about building relationships, understanding our users, and truly being there for them when it matters most. Our focus has always been to help teams manage incidents better.

What does SLO stand for? A complete guide to Service Level Objectives (SLOs)

The world of tech is full of acronyms. SLOs are one of those that everyone talks about, but maybe not everyone fully gets. Whether you're nodding along in meetings or just hearing “SLO” for the first time, we’ve got you covered. In this post, we’ll break down what Service Level Objectives (SLOs) actually are, why they matter, and how they can help keep your systems (and your sanity) in check.

The ultimate guide to on-call schedules

An Ultimate Guide to on-call schedules? You might think this sounds overly grandiose for what’s essentially putting people into a list and rotating through them. But you’d be flat-out wrong. Getting your on-call setup correct is as real and as important as it gets, and getting things wrong can lead to prolonged incidents, burnt out employees, and damaged company reputation.

Data quality testing

Data quality testing is a subset of data observability. It is the process of evaluating data to ensure it meets the necessary standards of accuracy, consistency, completeness, and reliability before it is used in business operations or analytics. This involves validating data against predefined rules and criteria, such as checking for duplicates, verifying data formats, ensuring data integrity across systems, and confirming that all required fields are populated.

Building On-call: Our observability strategy

At incident.io, we run an on-call product. Our customers need to be sure that when their systems go wrong, we’ll tell them about it—high availability is a core requirement for us. To achieve the level of reliability that’s essential to our customers, excellent observability (o11y) is one of the most important tools in our belt. When done right, observability improves your product experience from two angles.

Introducing: incident.io for Microsoft Teams

There’s a major outage. Support tickets are mounting. Everybody from engineering to legal is scrambling for information. You have more Teams notifications clamouring for attention than you do minutes to address them, and it’s hard to know where to begin. What comes next is a balancing act—mitigating the impact, updating colleagues, managing action items, or updating a status page that will be seen by millions.

Building On-call: Continually testing with smoke tests

With the release of On-call, our system’s reliability had to be solid from the outset. Our customers have high expectations of a paging product—and internally, we would not be comfortable with releasing something that we weren’t sure would perform under pressure. While our earlier product, Response, was the core of a customer’s incident response process after an incident was detected, we’re now the first notification an engineer gets when something’s wrong.

Redefining incident management: the power and pitfalls of AI

Like it or not, AI is having a monumental impact on our lives. Most of the products we engage with today have AI features and functionality, aimed at assisting or completely replacing the actions normally taken by humans. When it comes to incidents, we’re firm believers of accelerating human actions, and believe the risk of over-automation far outweighs the benefits. In this live event we’ll dig a little deeper on why, as we cover the power and pitfalls of AI.