Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Confessions of a software engineer who enjoyed being paged at 5am

It’s 5:14am, and I wake up to the squawking geese sound of my PagerDuty alert (anyone else have this sound? No?). I’m four months into working for my new team as a junior software engineer, and this is my first time being paged in the middle of the night. Most software engineers probably dread this moment, but I kind of love it. Agile ceremonies and Jira tickets suddenly don’t matter, and you’re fully focussed on stopping a customer-impacting fire.

Centrally set up and scale monitoring of your infrastructure and apps with Datadog Fleet Automation

Setting up and scaling observability across large, distributed environments often requires platform and SRE teams to coordinate access to infrastructure hosts and switch between configuration management tools and product-specific documentation. These tasks increase setup time and create delays in establishing visibility of critical services in Datadog. As teams expand their infrastructure, they need to coordinate Datadog configuration changes in a consistent and auditable way.

Python memory profiling: Common pitfalls and how to avoid them

Continuous profiling has established itself as core observability practice, so much so that we’ve referred to it as the fourth pillar of observability. But despite the capabilities and growing adoption of continuous profiling, it can still be confusing to approach profiling as a newcomer and correctly apply it to different troubleshooting scenarios.

Day 2 with Cilium: Small configurations that keep large clusters boring

Operating Cilium at a small scale is straightforward. You install the Helm chart, choose a routing mode, and apply a few network policies. Day 1 is about getting packets to flow. Day 2 is about keeping them boring. At Datadog, we run Cilium across hundreds of Kubernetes clusters, tens of thousands of nodes, and hundreds of thousands of pods in multiple clouds. When operating at this scale, small configuration choices stop being minor details and start becoming risk multipliers.

Text-to-Alert: Generating Netdata Alerts from Natural Language

Netdata has an incredibly powerful alerting engine. But this can sometimes be a double-edged sword: the flexibility to build incredibly specific, intelligent alerts is immense, but mastering its syntax can feel like learning a new language. We’ve heard this from so many of you. You tell us that configuring alerts is often the steepest part of the learning curve, a task that falls to the one “Netdata expert” on the team who has spent the time digging through the documentation.

A Year in Internet Analysis: 2025

This year-end wrap-up covers topics from BGP security (including ASPA and excessive AS-SETs) and the geopolitical (Ukraine’s IPv4 exodus, the Iran internet shutdown, and Red Sea cable cuts) to the year’s most significant outages (TikTok, the Spain/Portugal blackout, and cloud failures at AWS, Azure, and Cloudflare). Plus, we explore Starlink’s new Community Gateways, and revisit the evolving landscape of AS ranking and OTT service tracking.

Debug Faster with Chrome + Rollbar Debugging Assistant

Context switching is one of the biggest hidden productivity killers in debugging. Jumping between multiple open browser tabs slows momentum and increases cognitive load, especially when you’re trying to diagnose an issue under pressure. Google Chrome's new split screen feature, paired with Rollbar Debugging Assistant, enables a faster, more focused way to troubleshoot errors without constantly losing your place.

The Observability Stack is Collapsing: Why Context-First Data is the Only Path to AI-Powered Root Cause Analysis

By Bill Balnave, VP of Customer Success at Mezmo The core promise of modern observability is simple: cut Mean Time To Resolution (MTTR). Yet, despite a boom in tooling and investment over the last four years, the data tells a sobering story: our industry is actually getting worse at finding and resolving issues. Dashboards, once our trusted guide, have become the starting point for a chaotic "dashboard hunt" that rarely leads to the definitive root cause.

Site24x7's Kubernetes monitoring | Proactive, scalable, AI-powered

Kubernetes drives modern cloud-native applications, but its distributed nature creates visibility and performance challenges at scale. In this video, discover how Site24x7 provides real-time monitoring, AI-powered anomaly detection, and scalability for Kubernetes environments, helping you to proactively manage resources and resolve issues faster. Key features of Site24x7 Kubernetes Monitoring: Whether you're running a single Kubernetes cluster or managing multiple environments, Site24x7 helps you ensure peak performance and faster decision-making with minimal manual intervention.