Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

New Custom Dashboards: Metrics, Logs, Live Commands, and More in a Single View

Custom dashboards in Netdata have always let you pull charts together on-the-fly into a single view. That’s useful, but it’s also limited. In practice, when you’re running an incident or reviewing a service, you don’t just want charts. You want to see the output of top alongside your CPU metrics. You want slow query logs next to your database latency charts.

Claude outage April 2026: what happened and how it was detected early

On April 9, 2026, Claude experienced a widespread but inconsistent outage that left many users unable to access or interact with the service. StatusGator detected the issue early and sent an Early Warning Signal 59 minutes before the provider officially acknowledged the outage. This incident highlights how early detection can provide critical lead time when official status pages lag behind real user impact.
Sponsored Post

HIMSS 2026: The Future of Healthcare IT Operations Is Increasingly Autonomous

HIMSS 2026 made something clear: healthcare is no longer discussing digital transformation as a future-state goal. It is now dealing with the operational reality of having already become deeply digital. Conversations around HIMSS 2026 consistently pointed back to the same pressure points: AI adoption, cyber resilience, interoperability, and infrastructure modernization. Together, they reflect a healthcare environment managing more systems, more dependencies, and more risk than ever before.

How We Do Support at Scout

Today, we are taking a break from your regularly scheduled technical programming to talk about support. Here at Scout, we consider support one of our differentiators, and even as we adopt AI as a human multiplier behind the scenes, we are committed to keeping it real on the human-interaction side. It will be a long time, if ever, that you reach out to us and get a response from an AI agent. Would it be cheaper? Sure, but it isn’t up to our standards, and we won’t compromise on that.

The Best SKILL.md Is the One You Never Update - Meet Checkly's CLI

Most agent skills are static — frozen documentation snapshots that go stale the moment APIs change or flags get deprecated. Checkly does it differently. Our SKILL.md is just 100 lines of CLI pointers. No baked-in docs. Your coding agent learns what it needs, when it needs it, straight from the Checkly CLI.

Alert Acknowledgement: Mark It as Seen, Keep Working

If you’ve ever opened the alerts tab during a busy period, you know the problem. There are alerts you’ve already looked at, alerts someone on your team is handling, and alerts that fired on a known issue that’s being worked on. They all sit together in the same list alongside the new ones you haven’t seen yet.

Tech Talk | AI Agents in O11y Cloud

Transform reactive incident response with Splunk’s troubleshooting agents, designed to drastically reduce mean time to identify and resolve issues. This session demonstrates how a multi-agent approach empowers teams of all skill levels to pinpoint root causes, prioritize issues by business impact, and prevent future outages. Tech Talk sessions offer insightful and valuable deep-dives for any technical practitioner.

The Runbook Problem: How AURA Documents What Teams Don't Have Time to Write

Runbooks are rarely missing because teams don't value them. They're usually missing because incident response, follow-up, and platform work compete for the same limited time. By the time an issue is resolved, the knowledge is fresh, but the window to document it is already closing. That gap creates familiar failure modes: over-reliance on senior engineers, slower handoffs, and less confidence for whoever is on call next.