Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Observabilty for complex systems and related technologies.

Could vs. Should: The First Year Managing an SRE Team

As of today, I’ve drafted this post upwards of 10 times – it’s old enough that the version I first started working on was called “Reflections on 1 Year of SRE Management” (I’m currently at 2.5 years). But everything I learned during that first year became critical for the next.

Why Modern IT Incident Response Needs Social Sentiment Analysis

IT operations teams face an ongoing battle against alert fatigue. Despite running sophisticated telemetry and baseline Application Performance Monitoring, engineers are often bombarded with notifications that lead nowhere. Relying purely on internal dashboards creates a massive visibility gap, and when critical incidents slip through the cracks, the financial damage is swift and severe. To close this gap, DevOps professionals are increasingly looking beyond traditional server metrics and turning to a surprising source for early warning signals: public social sentiment.

How AI Agents Are Changing Each Agile SDLC Phase

The Agile software development lifecycle was designed to surface problems early, with short sprints, iterative testing, and continuous integration built on the premise that faster feedback loops produce better software. AI coding tools have changed the velocity equation across every phase of that loop, but the phases designed to catch failures are struggling to keep up because build speed and validation capacity have not accelerated at the same rate, and the gap between them is widening with every sprint.

Full-stack observability in Grafana Cloud: How to investigate issues across services and infrastructure

Many times, the hardest part of troubleshooting isn’t fixing the actual problem. It’s figuring out where to start. As engineers, it’s easy to lose count of how many times we’ve opened logs, then 10 metrics tabs, and another 10 tabs with trace queries, only to end up back in the logs trying to find a root cause.

What Customers Are Doing With AI and Honeycomb

At O11yCon, we talked to engineering teams across the industry, and the numbers are starting to get genuinely wild: Mixpanel DevOps Engineer Eddie Bracho told us their engineering team is generating 50% more PRs than before AI came into the mix (sorry). That kind of velocity is exciting, but it's also a pressure test for every part of your stack that isn't writing code, including your observability practice. Here's what we're hearing from customers about how that's playing out.

Debug and evaluate your AI app from your coding agent with Datadog Agent Observability

Coding agents like Claude Code, Cursor, and Codex CLI handle the coding parts of building an AI application well. The harder work comes after: understanding why a response went wrong, building eval sets that reflect real production behavior, and keeping up with an application that changes faster than any one-off script can. Teams spend 60–80% of their time on evaluation and error analysis, and much of that work needs to be redone every time the stack shifts.

New Feature: Automatic Snapshots When Latency Spikes

We’ve released an exciting new Lightrun capability: set a duration threshold on your Tic & Toc or Method Duration metrics, and Lightrun will automatically capture a snapshot whenever execution exceeds it. It takes moments to configure, and gives engineers the runtime context they need to understand why unexpected slow executions are occurring.

The hard part of AI root cause analysis is no longer the model

Every few weeks someone tells me root cause analysis is a solved problem now: pipe your telemetry into an LLM, let it tell you what broke. I wish it were that easy. After years on this, I think "can AI do RCA?" is the wrong question, because doing RCA with an LLM is really two separate jobs, and the answer is different for each. They break in completely different ways, so it's worth pulling them apart.

Instrumenting AI Agents for the Agent Timeline: A Practical OpenTelemetry Guide

AI agents are nondeterministic, multi-step, and opaque. When one fails in production, "the model said something weird" is the cheapest, most useless line in your incident postmortem. To debug agents the way they actually run, you need telemetry that captures all of it, in order, with enough context to reconstruct what happened. The OpenTelemetry GenAI Semantic Conventions give you a vendor-neutral way to do exactly that.

Why Observability Isn't Enough for AI Coding Agents

Observability platforms collect pre-instrumented logs, metrics, and distributed traces to monitor production systems and surface failures to human engineers. The adoption of AI into engineering has led observability providers to offer those same signals to agents. This is often packaged as AI observability, but the signals themselves were designed around a human investigation loop. AI coding agents work faster, consume data differently, and need feedback as they work rather than after deployment.