Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Observabilty for complex systems and related technologies.

Scary Things Happen in Production. Context Helps You Find Them.

Production is a rowdy place of chaos, especially at scale. When you have millions of requests per second flowing through your system, weird things are always happening. Outliers, unusual request patterns, spikes and pulses of traffic from unknown sources, port scanning…it’s all there. To the naked eye, it looks like noise. If you know what you are looking for…patterns emerge. The night sky: every dot is a request. Without intent, it's an undifferentiated field of light.

Smarter Alerts, Faster Root Cause, & Proactive IT Ops with SolarWinds AI Observability

Discover how AI is transforming IT operations with SolarWinds Observability. In this video, we showcase powerful new AI-driven features designed to help you detect issues faster, reduce alert noise, and stay ahead of performance problems across your entire stack. From applications and databases to networks, cloud infrastructure, and end-user experience SolarWinds AI delivers deep insights where it matters most.

Top Root Cause Analysis Tools Built for Runtime Context

Root cause analysis tools are designed to help engineering teams understand why failures happen in production and other remote environments. As modern systems become more distributed and input-dependent, many incidents cannot be reproduced outside live environments. The stakes are significant: high-impact IT outages cost organizations a median of $2 million per hour, with annual downtime costs reaching $76 million per organization.

Cribl Search Demo: Security Investigation

In this demo, Nate Zemanek , Staff Solutions Engineer, shows how Cribl Search runs fast investigations. As an open data platform, Cribl Search lets you pull data from multiple sources and query everything from a single pane of glass. You’ll see how to run fast queries with the new lakehouse engine, search historical data with a federated approach, and bring everything together for full context. Then, use Notebooks to collaborate and share findings across teams to understand what happened—faster.

How a Runtime Aware AI SRE Agent Transforms System Reliability

A runtime aware AI SRE extends existing AI SRE approaches by moving beyond telemetry correlation into runtime-validated reliability. While the majority of AI SRE tools accelerate incident triage using logs, metrics, and traces, they cannot confirm execution behavior if critical runtime signals were never captured. By generating on-demand evidence inside running services, AI SRES can eliminate slow redeploy cycles, ensuring your distributed systems remain resilient under real-world traffic conditions.

From Observability to Action: How Product Analytics Is Closing the Loop in Modern Operations

Over the past decade, observability has become a cornerstone of modern operations. Metrics, logs, and traces have given teams unprecedented visibility into how systems behave under real-world conditions. Infrastructure can be monitored in real time, incidents can be detected faster, and performance bottlenecks can be diagnosed with increasing precision. But for all its progress, observability still leaves an important question unanswered.

Leveraging Cognitive Diversity to Tackle System Complexity

Most engineering leaders today understand that diversity matters. They've built teams that reflect a range of backgrounds, functions, and experience levels. They run postmortems, retrospectives, and architecture reviews that bring multiple voices to the table. They believe, not unreasonably, that this variety of perspectives leads to better decisions. But there's a problem hiding inside that assumption that can undermine everything: who people are is a surprisingly poor predictor of how they think.

How OpenRouter and Grafana Cloud bring observability to LLM-powered applications

Chris Watts is Head of Enterprise Engineering at OpenRouter, building infrastructure for AI applications. Previously at Amazon and a startup founder. As large language models become core infrastructure for more and more applications, teams are discovering a familiar challenge in a new context: you can't improve what you can't see.

Making encrypted Java traffic observable with eBPF

Coroot's node agent uses eBPF to capture network traffic at the kernel level. It hooks into syscalls like read and write, reads the first bytes of each payload, and detects the protocol: HTTP, MySQL, PostgreSQL, Redis, Kafka, and others. This works for any language and any framework without touching application code. For encrypted traffic, we attach eBPF uprobes to TLS library functions like SSL_write and SSL_read in OpenSSL, crypto/tls in Go, and rustls in Rust.