Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Observabilty for complex systems and related technologies.

Innovation Week Day 2: Observability for AI, and Observability With AI

AI is reshaping the SDLC in two directions at once. AI-generated code is shipping faster and with less human supervision than ever before, while agents and LLMs are running directly in production, where they behave very differently from traditional software: non-deterministic, with a wider blast radius than any single function or component, with no stack trace to catch when something goes wrong.

Honeycomb Innovation Week: Observability With AI With Kale and Taylor

Watch this video to see the re-imagined Canvas in action, where auto-investigation has already ranked your hypotheses before you open the tab, multiplayer agents build on each other's work in real time, and a custom skill encoding your team's own runbook can reprioritize the entire incident before you've had your morning coffee.

Innovation Week Day 1: The SDLC Is Collapsing, and Observability Has Never Mattered More

The software development lifecycle is collapsing. The multi-stage pipeline that defined how software got built and shipped for decades is compressing into rapid loops of intent and validation, with agents now part of the teams building and running it. Day 1 of Innovation Week was about what that shift means for how software gets validated, where observability fits, and the problems that have always been hard but are now genuinely urgent.

Security Integrations in Observability Self-Hosted

Integrating security data with observability data provides a comprehensive view for better threat detection and response. Security observability helps connect the dots between seemingly innocent events that, when correlated, reveal complex attack patterns. SolarWinds security products integrate into observability self-hosted, including Security Event Manager for log data and event correlation, Access Rights Management for identifying potential attack vectors, configuration management for compliance monitoring, and Patch Manager for tracking critical updates.

Making Semantic Conventions Work for You With OpenTelemetry Weaver

Your dataset has hundreds of attributes. Some are self-explanatory: http.response.status_code, server.address. Others are not: meta.refinery.reason, dataset.slug, sli.latency_target_ms. If you don't know what an attribute means, you can't write a good query. And if an AI agent doesn't know what it means, it guesses.

Why Alert Fatigue Solutions Still Miss the Root Cause

Alert fatigue solutions have never been better, but on-call engineers are still burning out. Threshold tuning, AI triage, and alert correlation reduce the noise, but every alert that clears filtering lands with the same incomplete telemetry and triggers the same manual investigation cycle. This post explains why the evidence gap survives every fix, and how runtime context changes that.

Multi-tiered Observability: A Practical Way to Handle Diverse Workloads

Observability in large companies is rarely one-size-fits-all. The VictoriaMetrics topologies guide shows why different deployment patterns are needed as scale, isolation, and reliability requirements grow. Different workloads require different trade-offs: some need long retention for audits and trend analysis, while others need higher resolution for debugging. Business-critical systems also demand dependable alerting and high availability, often with several 9s of reliability.

Why Blast Radius Analysis Does Not End When Alerts Fire

Modern distributed systems fail in ways that can bypass even well-designed isolation patterns. When a failure is actively propagating across services at four in the morning, the question shifts from “how do we limit the blast radius” to “how do we confirm what it actually is.” Monitoring shows which services are in the impact zone, but it cannot show what code path caused the failure to spread, or whether it has stopped.

Span or Attribute in OpenTelemetry Custom Instrumentation

TL;DR: Attribute. More information on one event gives us more correlation power. It’s also cheaper. When you want to add some information to your tracing telemetry, you could emit a log, create a span, or add a piece of data to your current span. Adding a piece of data to your current span is the best! Usually.