Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

IT Incident Management - What is it and how to do it?

Are you tired of dealing with IT incidents that seem to pop up at the worst possible times? Do you find yourself struggling to keep track of all the moving pieces involved in resolving incidents? If so, it’s time to revitalize your incident management strategy. In this article, we’ll explore the key pillars of incident process management, best practices, and how technology can help streamline your process.

Which Software Stack is best for IT service management?

IT-Incident Management - a hot topic and more important than ever in the digital age. Companies are increasingly relying on technology to maintain their operations, as any downtime can have catastrophic consequences. On average, one minute of downtime costs $9,000. ‍ Therefore, an efficient and especially organization-specific incident management system is essential. However, there are many components and options in incident management, so what software stack should you use? ‍

On-call management on the go: Introducing the Grafana OnCall mobile app

We’ve all been there: Sleeping peacefully in bed over the weekend, finally getting rest after a long week at your computer making AI-generated memes writing code. Then at 3 a.m., your phone makes an ungodly sound, and you wake up startled, frazzled, and confused. When you finally type in your passcode to unlock your phone (because facial recognition doesn’t register your bleary-eyed, squinty face), you see an alert, and all dreams of sleep are over.

Streamline Incident Response with Komodor and Squadcast

With the growing popularity of Kubernetes as a container orchestration platform powering the microservices revolution, comes greater complexity with managing, monitoring, and responding to incidents at scale. Challenges with real production environments include full visibility into your clusters and environment’s health, alongside real-time incident management and response.

Using DORA metrics Mean Lead Time for Changes to deliver iterations faster

Here's what you can expect to learn from this article: Raise your hand if you like shipping changes quickly. (Yes, let's assume that everything you're shipping has value and isn't a vanity project). Chances are, you, the person reading this now, agreed with the above. When you start on a project, big or small, you want to keep any changes moving along and avoid getting stuck. The less time between the beginning and end of a project, the faster you can shift your focus to other things.

AWS CloudTrail vs CloudWatch: Features & Instructions

In today’s digital world, cloud computing is necessary for businesses of all types and sizes, and Amazon Web Services (AWS) is undoubtedly the most popular cloud computing service provider. AWS provides a vast array of services, including CloudWatch and CloudTrail, that can monitor and log events in AWS resources. This article will compare AWS CloudWatch and CloudTrail, looking at their features, use cases, and technical considerations.

AIOps and Automation: A Conversation Featuring Guest Speaker Carlos Casanova, Forrester Principal Analyst

At the beginning of 2023, I had a great conversation with Carlos Casanova, a Forrester Principal Analyst, in a recent webinar about how AIOps can help drive successful organizational change. According to our conversation, Carlos has divided the AIOps market into two camps: technology-centric (primarily APM/Observability players) and process-centric. PagerDuty is a process-centric solution leveraging multiple technologies.

Featured Post

After action reports: post-incident investigations

When something unexpected happens within the digital operations remit, software engineers put on their deerstalker hats and wax their fussy little moustaches-metaphorically. It's their time to play detective as they unravel the evidence and create the reports to explain the recent IT incident. But unlike with a hat-wearing Sherlock Holmes or a hirsute Hercule Poirot, cliff-hanger endings are not encouraged in software engineering.

Understanding Kubernetes Logs and Using Them to Improve Cluster Resilience

In the complex world of Kubernetes, logs serve as the backbone of effective monitoring, debugging, and issue diagnosis. They provide indispensable insights into the behavior and performance of individual components within a Kubernetes cluster, such as containers, nodes, and services.