Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Redundancy for IT resilience: The backup guide for a disaster-proof network

Around six years ago on a Wednesday morning, software professionals worldwide were startled by a tweet from GitLab stating that they had accidentally deleted their production data, causing their site to go offline. Unfortunately, at that point in time, the open-source code repository giant had no idea that it would take them another 36 hours to restore their systems only to learn that 5,000 projects and 700 new user accounts were affected while they were fixing the outage.

The Guide to SRE Principles

Site reliability engineering (SRE) is a discipline in which automated software systems are built to manage the development operations (DevOps) of a product or service. In other words, SRE automates the functions of an operations team via software systems. The main purpose of SRE is to encourage the deployment and proper maintenance of large-scale systems.

The 7 IT Automations for Highly Effective Organization: IT incident Remediation | Web App Down

No organization is immune to outages, unplanned interruptions, or quality reduction of normal service. But having a streamlined response plan can ensure these situations are dealt with more effectively to restore normalcy. In a world where IT efficiency is being measured by mean time to resolution, triaging and remediating alarms can directly impact the business in a positive way.

Komodor + Squadcast Integration: Simplifying Kubernetes Monitoring & Incident Response

Kubernetes (K8s) is a powerful tool for container orchestration, but it presents unique challenges when it comes to monitoring and incident response. Managing K8s requires 360º visibility into your environment, proactive health monitoring, along with right incident management, and suppression capabilities. In this article, we'll explore the benefits of integrating Squadcast with Komodor, two powerful tools that can help you overcome these challenges.

How metrics can make or break your IT operations strategy

IT people know that data is king, especially in optimizing IT operations. However, figuring out which metrics to collect and how to collect them can be challenging. IT teams have to factor in what IT directors, team managers, and the people overseeing operations want, what they’re concerned about, and what they consider important.

Managing Incidents in Energy and Utility Companies

Several challenges impact customers and operations of utilities and energy companies, including aging infrastructure, cybersecurity threats, inclement weather, operational failures and transmission interruptions. These challenges can cause prolonged service disruptions, potentially leading to customer attrition and irreversible damage to businesses. Responding quickly and efficiently to incidents is critical to minimize damages or contain potentially dangerous scenarios.

What are you learning from your incidents?

Think about this—what was the last incident that challenged you? Did you learn anything from it? It will be shocking to no one to hear that we deal with our fair share of incidents. These run the gamut from tiny bugs to significant outages (thankfully, the latter happening only very rarely 😮‍💨). Either way, we always take the time to learn from them in some way. This might look like changes to our response processes or revisiting systems we’re using.

Top 5 Managed Detection and Response Services and How to Choose

Managed Detection and Response (MDR) is an approach to cybersecurity that combines advanced technologies, skilled analysts, and a proactive response process to detect, investigate, and remediate cyber threats. MDR is typically delivered as a service by a third-party provider and includes a range of security capabilities, such as threat intelligence, behavior analysis, anomaly detection, and incident response.