Operations | Monitoring | ITSM | DevOps | Cloud

True reliability takes the whole team

Reliability takes the whole team working together. Full transcript:  If you really want to get good at measuring your reliability, then you have to work together as a team. Once your software engineer organization has decided, "We're gonna test these applications to make sure that they have redundancy, availability, resilience." Just stick to that framework that you come up with as a team.

Fix issues faster with Recommended Remediations

You’ve successfully run a Fault Injection test and uncovered a new failure mode before it impacted customers. And the failure could have taken down your whole system if it had happened in production. Now what? Since this is a potential P1 outage, you absolutely need to address the issue, but that’s going to take some time as you dig through the service to track down the problem. Unfortunately, this is a common conflict.

High Availability by Design: WhatsUp Gold Strategic Shift from Failover

As IT environments grow more distributed and resilient, the Progress WhatsUp Gold network monitoring solution is evolving to meet the moment. Starting in early 2026, Progress will officially retire the legacy Failover Manager and usher in a new era of high availability (HA) by design. This modern, scalable approach aligns with today’s best practices in infrastructure.

Top AI Prompts for Engineering Leaders using the Cortex MCP

AI assistants have transformed how developers work. And now coupled with the Cortex MCP that connects AI assistants directly to live service data, ownership records, and organizational standards, developers can get accurate, context-rich answers about their services and standards right in their IDE. → Tips and prompts for developers using the Cortex MCP But what about engineering leaders?! Your opportunities with AI assistants extend far beyond code generation.

How to Prove DNS Monitoring ROI to Clients (Without Getting Technical)

Most clients don’t care how DNS works—until it breaks. But as an MSP, you know the damage a single DNS misconfiguration or unnoticed change can cause. So how do you prove the ROI of DNS monitoring to clients who don't speak in TTLs or CNAMEs? Here’s how to bridge the gap between technical benefits and business value—so your clients understand exactly why they’re paying for DNS protection.

A complete security view for every Ubuntu LTS VM on Azure

Azure’s Update Manager now shows missing Ubuntu Pro updates for all Ubuntu Long-Term Support (LTS) releases: 18.04, 20.04, 22.04 and 24.04. The feature was first introduced for only 18.04 during its move to Expanded Security Maintenance. With this addition, Azure highlights where Ubuntu LTS instances would benefit from Expanded Security Maintenance updates if the administrator attaches an Ubuntu Pro license, even for instances running more recent Ubuntu releases.

IT Alerting: Everything You Need to Know

Behind every reliable service is a team of people watching for problems. But they don’t stare at screens all day. They rely on IT alerting systems. An IT alerting system tells you when something is wrong. It finds problems fast, so your team can fix them before your business or customers are affected. This article will explain everything you need to know about IT alerting. You’ll learn what it is, why you need it, how to set it up, and which tools work best. Table of Contents.

Grafana Mimir: 3 reasons to run the TSDB for Prometheus on bare metal

Wilfried Roset is an engineering manager who leads an SRE team and he is a Grafana Champion. Wilfried currently works at OVHcloud where he focuses on prioritizing sustainability, resilience, and industrialization to guarantee customers satisfaction. Whether it’s for efficient resource allocation, flexibility, high availability, or scalability, it makes a lot of sense to run Grafana Mimir on Kubernetes—but it’s not the only way to deploy Mimir.

Instrument your Azure Container Apps workloads with the new Datadog Agent sidecar

Modern application development is evolving rapidly, with serverless containers and microservices becoming the standard for scalable, resilient architectures. Azure Container Apps is at the forefront of this movement, enabling developers to deploy containerized applications without having to manage infrastructure.