Monthly Archive

Understanding the Three Pillars of Observability: Logs, Metrics and Traces

Apr 30, 2026 By Sandro Lima In ChaosSearch

Many people wonder what the difference is between monitoring vs. observability. While monitoring is simply watching a system, observability means truly understanding a system's state. DevOps teams leverage observability to debug their applications, or troubleshoot the root cause of system issues. Peak visibility is achieved by analyzing the three pillars of observability: Logs, metrics and traces. Depending on who you ask, some use MELT as the four pillars of essential telemetry data (or metrics, events, logs and traces) but we'll stick with the three core pillars for this piece.

Read Post

ChaosSearch

Read more about Understanding the Three Pillars of Observability: Logs, Metrics and Traces

New in the Honeycomb Academy: Learn to Use the Honeycomb MCP

Apr 29, 2026 By Midge Pickett In Honeycomb

Two things happen when engineers first connect the Honeycomb MCP to their AI assistant. The first is the blank page problem. The Honeycomb UI gives you something to react to: a heatmap, a query builder, a trace to click into. An AI assistant gives you a cursor and nothing else. When you don't know where to start, that's a hard place to be. The second shows up right after you get past the first one. You ask a question, you get a confident-sounding answer, and you're not sure whether to trust it.

Read Post

Honeycomb

Read more about New in the Honeycomb Academy: Learn to Use the Honeycomb MCP

State of Observability in Financial Services 2026: From implementation to business impact

Apr 28, 2026 By Leah McEwen In Elastic

The demands on financial services companies are intensifying rapidly. They must not only deliver seamless system performance but also control costs, secure sensitive data, and maximize the value of their observability investments. To navigate these converging pressures, leaders are evolving their approach to system monitoring and telemetry. The 2026 State of Observability in Financial Services research report reveals a fundamental shift in how organizations manage their digital infrastructure.

Read Post

Elastic

Read more about State of Observability in Financial Services 2026: From implementation to business impact

What "AI-Ready Data" actually means for observability teams

Apr 28, 2026 By Micha Duman In Coralogix

Many organizations deploying AI are learning similar lessons right now: the challenge isn’t this or that AI model, it’s the data. According to Gartner, 60% of AI projects will be abandoned by organizations because of failures to support these projects with AI-ready data. Also, 63% of organizations either lack or aren’t sure they have the right data management practices to get there.

Read Post

Coralogix

Read more about What "AI-Ready Data" actually means for observability teams

Service-Centric Observability as the Control Layer

Apr 28, 2026 By ScienceLogic In ScienceLogic

If distributed architectures have altered how systems degrade, then the way organizations model operational must evolve accordingly. Threshold monitoring evaluates individual metrics. Correlation clusters related alerts. Neither, on its own, explains how instability in one component alters exposure across an interconnected service landscape. In conversations at Nexus Live 2025, ScienceLogic’s annual customer conference, leaders described this distinction with clarity.

Read Post

ScienceLogic

Read more about Service-Centric Observability as the Control Layer

Get observability in the terminal, for you and your agents, with the gcx CLI tool

Apr 28, 2026 By Ward Bekker In Grafana

The way you write code is changing, which means the way you observe your systems and respond to issues needs to change, too. Engineers today spend much of their day working via command line, as agentic tools like Cursor and Claude Code have become highly effective at handling many day-to-day engineering tasks. This greatly accelerates code generation, but it doesn't solve for the context switching that comes when you have to jump into another tool that's not part of this new, faster workflow.

Read Post

Grafana

Read more about Get observability in the terminal, for you and your agents, with the gcx CLI tool

Approaching the Parhelion

Apr 27, 2026 By Austin Parker In Honeycomb

One early spring morning in 1535, the residents of Stockholm awoke to a most curious sight. Six suns lit up the sky, connected by bright halos, as immortalized in Vädersolstavlan, seen here. Today, we recognize these atmospheric effects as a parhelion (also referred to as ‘sun dogs’)—an illusion caused by light refracting off crystalline formations in the atmosphere.

Read Post

Honeycomb

Read more about Approaching the Parhelion

Zero-config Go heap profiling

Apr 27, 2026 By Nikolay Sivko In Coroot

Coroot's node-agent already collects CPU profiles for any process on the node using eBPF, with zero integration from the application side. For Java, we dynamically inject async-profiler into the JVM to get memory and lock profiles. But Go processes were still a blind spot for non-CPU profiling unless the app exposed a pprof endpoint and the cluster-agent scraped it. We wanted the same zero-config experience for Go heap profiles. This post is about how we got there.

Read Post

Coroot

Read more about Zero-config Go heap profiling

Not All Telemetry Requires Premium Pricing

Apr 27, 2026 By Pablo Fernandez In VictoriaMetrics

Observability in software is often framed as a choice between self-hosted and SaaS: manage it yourself, or pay a vendor to handle your data. Both self-hosted and SaaS approaches have their merits, but assuming you must choose one exclusively over the other leads to poor trade-offs: either overcommitting to an all-in-one SaaS despite spiraling costs, or fully self-hosting when it’s unnecessary.

Read Post

VictoriaMetrics

Read more about Not All Telemetry Requires Premium Pricing

Code Agents Need Observability

Apr 26, 2026 By Lily Waldorf In Coralogix

For those of us using tools like Claude Code, Codex, or Gemini, we already know they’re powerful. They can write code, refactor functions, open PRs, even run commands. For a lot of developers, they’re already part of the daily workflow. But once you zoom out beyond the individual developer, the biggest problem isn’t productivity. It’s control. AI coding tools are powerful, but they introduce a new, unpredictable cost layer that most teams don’t fully understand.

Read Post

Coralogix

Read more about Code Agents Need Observability

Managing OpenTelemetry Semantic Convention Migrations With the Collector

Apr 23, 2026 By Mike Goldsmith In Honeycomb

Real production data tells the story better than I can. Juraci Paixão Kröhling, a friend and fellow observability practitioner at OllyGarden, recently shared an example from an anonymized production environment: 1,830 occurrences of http.url and 23,984 occurrences of url.full in the same dataset. Both attributes describe the same thing. Both are actively being written to the same backend at the same time.

Read Post

Honeycomb

Read more about Managing OpenTelemetry Semantic Convention Migrations With the Collector

What Is AI Agent Observability? Why Cost Is The Signal You're Missing

Apr 23, 2026 By Keith MacKenzie In CloudZero

Your LLM observability stack probably handles individual model calls well enough. Latency, token counts, error rates, maybe even evaluation scores....

Read Post

CloudZero

Read more about What Is AI Agent Observability? Why Cost Is The Signal You're Missing

Beyond Uptime: Building a Self-Healing OpenClaw Observability Stack

Apr 23, 2026 By Daniel In StatusCake

The allure of OpenClaw is undeniable. You deploy a highly autonomous, self-hosted AI agent, give it access to your repositories and inboxes, and watch it reason through complex workflows while you sleep. It is the dream of the ultimate 10x developer tool realized. But as any veteran DevOps engineer will tell you: running an LLM-backed Node.js agent in production is vastly different from testing it on your local machine.

Read Post

StatusCake

Read more about Beyond Uptime: Building a Self-Healing OpenClaw Observability Stack

Observability Focus: Why It Became the Default Language of Modern IT Operations

Apr 23, 2026 By OpsMatters In OpsMatters

Digital services run on fragile highways of microservices, containers, and event streams. Outages no longer hide inside a single server rack; they ripple across regions and ruin brand trust in minutes. Because uninterrupted insight now decides whether a launch soars or stalls, engineers treat observability as the vocabulary for every architectural choice, deployment ritual, and post-incident review. Similar discipline emerges in studios that refine professional end-to-end game dev workflows, where frame drops and lag spikes receive the same diagnostic rigor expected of banking APIs.

Read Post

OpsMatters

Read more about Observability Focus: Why It Became the Default Language of Modern IT Operations

What Is LLM Observability? For CFOs And Engineers, The Missing Layer Is Cost

Apr 22, 2026 By Keith MacKenzie In CloudZero

You probably have Datadog. Maybe New Relic, maybe Dynatrace. Your observability stack has been solid for years — and you're still flying blind on AI cost. Here's why LLM observability needs a fourth pillar most tools skip, and how to build one that actually tells you what your models are costing you per request, per feature, per customer.

Read Post

CloudZero

Read more about What Is LLM Observability? For CFOs And Engineers, The Missing Layer Is Cost

From Microsoft SCOM to Dashboards

Apr 21, 2026 By NiCE IT Mgmt In NiCE IT Mgmt

System Center Operations Manager (SCOM) remains one of the most capable on-premises monitoring platforms for Microsoft environments. However, as IT operations evolve toward real-time observability and self-service insights, traditional SCOM reporting and consoles can feel restrictive. This whitepaper explores practical ways to extend and modernize your SCOM visualizations using today's leading dashboarding technologies - including SquaredUp, Grafana, Power BI, and Azure Workbooks.

Read Post

NiCE IT Mgmt

Read more about From Microsoft SCOM to Dashboards

Moving Beyond SolarWinds: Building a Modern Observability Strategy

Apr 21, 2026 By Andy Wojnarek In Galileo

For years, platforms like SolarWinds have been a standard in IT environments. They helped teams answer a fundamental question: are systems up or down? That approach worked well when environments were more contained and predictable. The challenge is that most environments no longer operate that way. Hybrid infrastructure, cloud services, and tightly interconnected applications have changed what “visibility” needs to mean.

Read Post

Galileo

Read more about Moving Beyond SolarWinds: Building a Modern Observability Strategy

AI Observability in Grafana Cloud: A complete solution for monitoring your agentic workloads

Apr 21, 2026 By Maurice Rochau In Grafana

The observability industry has developed great tools for using metrics, logs, traces, and profiles to monitor the cloud native applications that have dominated the last decade of software development. But when it comes to understanding what an AI system is actually doing, we’re often left reading raw conversations, guessing at quality, and reacting too late. And that’s a problem.

Read Post

Grafana

Read more about AI Observability in Grafana Cloud: A complete solution for monitoring your agentic workloads

Introducing o11y-bench: an open benchmark for AI agents running observability workflows

Apr 21, 2026 By Yasir Ekinci In Grafana

Evaluating agents is hard. Verifying observability tasks is harder. Yes, AI agents have gotten dramatically and quantifiably better at coding and tool use, but observability presents a different kind of challenge. In a real incident, the hard part is rarely just writing a query. It's deciding which signal matters, figuring out whether a spike is noise or symptom, correlating metrics with logs and traces, and sometimes making a change in Grafana without breaking the dashboard another engineer depends on.

Read Post

Grafana

Read more about Introducing o11y-bench: an open benchmark for AI agents running observability workflows

Bringing observability data hosting to the UK on AWS

Apr 21, 2026 By Geoffrey Carlisle In Datadog

UK organizations are increasingly required to design systems that account for data residency requirements, ensuring that operational data remains within national boundaries. Many teams already run their applications on AWS infrastructure in the UK, but telemetry data can still be processed outside the region, creating gaps in visibility. Datadog’s upcoming UK availability zone solves this by keeping telemetry data in the same region as the workloads that generate it.

Read Post

Datadog

Read more about Bringing observability data hosting to the UK on AWS

No more monkey-patching: Better observability with tracing channels

Apr 21, 2026 By Sigrid Huemer In Sentry

Almost every production application uses a number of different tools and libraries,whether that’s a library to communicate with a database, a cache, or frameworks like Nest.js or Nitro. To be able to observe what’s going on in production, application developers reach out for Application Performance Monitoring (APM) tools like Sentry. But there’s an inherent problem: the performance data that APM tools need is most often not coming natively from the libraries themselves.

Read Post

Sentry

Read more about No more monkey-patching: Better observability with tracing channels

Fast AI Feedback Loops with Honeycomb and OpenTelemetry

Apr 20, 2026 By Ken Rimple In Honeycomb

Are you writing agentic applications, but aren’t sure what the agents are doing? Finding out too late that you've blown the budget with super expensive models? Not sure where the agents are failing, and feeling a loss of control? Could they do better? Observability is the visibility you need to get the job done. Sending telemetry to Honeycomb explains what your agents are actually doing.

Read Post

Honeycomb

Read more about Fast AI Feedback Loops with Honeycomb and OpenTelemetry

How to solve key site reliability engineering challenges

Apr 20, 2026 By Lightrun Team In Lightrun

Modern site reliability engineering challenges stem from the difficult requirement of confirming why complex systems fail in ways staging cannot replicate. While observability tools signal failures, and AI SREs reason over data, they leave observability gaps regarding the actual state of running code. By utilizing runtime context, teams capture live execution data to accelerate production debugging, resolving incidents in minutes without requiring manual redeploy cycles.

Read Post

Lightrun

Read more about How to solve key site reliability engineering challenges

How Observability Powers Autonomous IT in Hybrid Environments

Apr 20, 2026 By LogicMonitor In LogicMonitor

Autonomous IT only works when observability gives it the context to act with confidence. On any given day, a mid-size enterprise generates tens of thousands of alerts across on-prem infrastructure, multiple clouds, SaaS tools, Internet dependencies, and AI workloads. Most of them don’t need a human. A few of them do. Telling the difference, fast enough to matter, is exactly where IT teams are losing ground.

Read Post

LogicMonitor

Read more about How Observability Powers Autonomous IT in Hybrid Environments

Uptrace MCP Server: Auto-Generate Dashboards with AI in Minutes

Apr 20, 2026 By Uptrace In Uptrace

Tired of clicking through menus to build observability dashboards? In this video I walk through how to configure the Uptrace MCP (Model Context Protocol) server and connect it to an AI assistant so your dashboards get created automatically from natural-language prompts. You'll learn how to: By the end you'll have a working setup where describing what you want to monitor is enough to get a real, shareable dashboard in Uptrace.

View Video

Uptrace

Read more about Uptrace MCP Server: Auto-Generate Dashboards with AI in Minutes

Observability is a design problem: Live Laugh Logs ep. 1 - KubeCon Amsterdam 2026

Apr 20, 2026 By Coralogix In Coralogix

What happens when 20,000 engineers descend on Amsterdam to talk about Kubernetes and AI? Welcome to Episode 1 of Live Laugh Logs, the podcast from Annie, Lewis and Andre from the Coralogix Developer Relations team where we will get together and recap everything going on in our worlds! We had an amazing time at KubeCon in Amsterdam and had loads of insights from the talks we went to around designing observability systems, all the AI tools being created and how to observe them, and using agent-generated code.

View Video

Coralogix

Read more about Observability is a design problem: Live Laugh Logs ep. 1 - KubeCon Amsterdam 2026

Building Audit-Ready Observability for Digital Banking

Apr 20, 2026 By Lily Waldorf In Coralogix

Most observability platforms are built to answer one question: what’s broken right now. Regulators are asking a different one: what happened, exactly, and can you prove it? Digital banking operates under constant regulatory scrutiny, where frameworks like DORA, PCI-DSS, and GDPR require every incident to be fully reconstructed across systems, timelines, and access. Systems can recover quickly, but the ability to explain what happened often remains fragmented across tools and teams.

Read Post

Coralogix

Read more about Building Audit-Ready Observability for Digital Banking

Centralize observability management with Datadog Governance Console

Apr 17, 2026 By David Iparraguirre In Datadog

As organizations grow, they face increasing difficulty in managing their observability efforts. More teams mean more dashboards, monitors, API keys, pipelines, and custom configurations. Without a centralized view, administrators spend hours chasing down untagged resources, investigating surprise bills, and revoking dormant credentials. Governance becomes a reactive effort to reduce waste and address issues, falling short of its potential to proactively create standards and optimize observability.

Read Post

Datadog

Read more about Centralize observability management with Datadog Governance Console

Choosing an AI-Driven Observability Platform for Complex Enterprise IT

Apr 17, 2026 By david.arrowsmith In Interlink

Selecting the right observability platform has become a strategic priority for enterprises operating at scale.

Read Post

Interlink

Read more about Choosing an AI-Driven Observability Platform for Complex Enterprise IT

You Don't Need Three Pillars, You Need Single Threads

Apr 16, 2026 By Erwin van der Koogh In Honeycomb

Last week was a great reminder for me about the challenges of the traditional model of observability defined by the “three pillars” of metrics, logs, and traces. One of the customers I’m currently working with is a large financial institution that has a robust three pillar implementation. Every critical application ships their telemetry to either or both their cloud-native tool and a central tool.

Read Post

Honeycomb

Read more about You Don't Need Three Pillars, You Need Single Threads

Building a Unified Enterprise Observability Strategy Webinar

Apr 16, 2026 By SquaredUp In Squared Up

Join Graham Davies, Technical Product Manager at SquaredUp as he provides a practical guide to breaking down data silos between IT, operations and the business. In this session, Graham digs into why dashboard and tool sprawl is making decisions harder, not easier, and shows you a practical framework for building a single source of truth your whole organisation can rely on.

View Video

Squared Up

Read more about Building a Unified Enterprise Observability Strategy Webinar

The End of Manual Instrumentation: Scaling Observability with OTel OBI & Coralogix

Apr 16, 2026 By Jonny Steiner In Coralogix

Traditionally, achieving deep visibility into distributed systems required significant trade-offs in engineering time. Collecting meaningful application metrics and traces required teams to embed language-specific agents, modify source code, or manage complex library dependencies across every service.

Read Post

Coralogix

Read more about The End of Manual Instrumentation: Scaling Observability with OTel OBI & Coralogix

What Is an AI SRE? And Why Do They Need Live Runtime Evidence?

Apr 15, 2026 By Lightrun Team In Lightrun

AI SREs are autonomous systems that handle incident triage, root cause analysis, and remediation by correlating logs, metrics, traces, and code signals. However, as they rely on pre-configured telemetry, the critical execution details of a specific failure, such as variable state and code paths, can often be missed. As a result, they either force users into manual redeploy loops or make inferences from partial data, diagnosing issues using probability rather than proof.

Read Post

Lightrun

Read more about What Is an AI SRE? And Why Do They Need Live Runtime Evidence?

AI Observability is Coming...

Apr 15, 2026 By Grafana In Grafana

Thanks for watching!

View Video

Grafana

Read more about AI Observability is Coming...

Fewer Tools, Faster Fixes: A Practical Guide to Observability Consolidation

Apr 14, 2026 By Sentry In Sentry

Most observability stacks aren’t designed, they accumulate. A logging tool here, a tracing platform there, and before you know it you’re managing rising costs and a setup that ultimately slows down your team. And you’ve moved further away from actually solving problems for your users.

View Video

Sentry

Read more about Fewer Tools, Faster Fixes: A Practical Guide to Observability Consolidation

ICYMI: Is This Code Worth Running? Here's How to Know

Apr 14, 2026 By Rox Williams In Honeycomb

Over the last three months, we’ve been exploring what about software development and observability changes with AI, and what doesn’t. Our conclusion: these five principles will still remain true, even when 90% of the code is AI-driven. The agentic AI space is moving fast. Models are improving, context windows are expanding, and the ways people build and operate agents are changing so fast that any thoughts we share could feel dated by the time you read this.

Read Post

Honeycomb

Read more about ICYMI: Is This Code Worth Running? Here's How to Know

Optimizing the OpenTelemetry Python SDK for LLM Workloads

Apr 13, 2026 By Alex Boten In Honeycomb

Agentic workloads thrive with precision tooling. Just like developers, they need the rich context, high cardinality, and fast feedback loops that allow them to ask exploratory open-ended questions of their code. But instrumentation is costly, and from the dawn of software, developers have tried to do the most possible with the least amount of resources.

Read Post

Honeycomb

Read more about Optimizing the OpenTelemetry Python SDK for LLM Workloads

Top 6 AI SRE Tools and Why Runtime-Grounded Reliability Is the New Standard

Apr 13, 2026 By Lightrun Team In Lightrun

AI SRE tools accelerate incident detection, root cause analysis, and remediation across distributed production systems. They ingest telemetry signals, including logs, metrics, traces, alerts, and deployment history, to correlate anomalies, narrow fault domains, and reduce manual triage. This guide breaks down the top AI SRE tools in 2026 and helps you choose the right one based on your team’s biggest bottleneck, whether that is faster triage, deeper root cause analysis, or runtime-level validation.

Read Post

Lightrun

Read more about Top 6 AI SRE Tools and Why Runtime-Grounded Reliability Is the New Standard

Beyond the Dashboard: Selector's Patented Approach to Conversational Observability

Apr 10, 2026 By Bob Slevin In Selector

For years, IT operations teams have been trapped in a frustrating paradox: the data they need to solve critical issues is right at their fingertips, yet entirely out of reach. Accessing it requires engineers to master complex, platform-specific query languages, dig through endless layers of dashboards, and hunt for the exact visualization that holds the answer. Under the intense pressures of modern speed, scale, and complexity, this rigid model is breaking down.

Read Post

Selector

Read more about Beyond the Dashboard: Selector's Patented Approach to Conversational Observability

Your Questions About AI Agents and Production Feedback Answered

Apr 10, 2026 By Austin Parker In Honeycomb

On April 1st, I joined Akshay Utture from Augment Code for a webinar on how AI agents use production feedback to improve code.

Read Post

Honeycomb

Read more about Your Questions About AI Agents and Production Feedback Answered

Tech Talk | AI Agents in O11y Cloud

Apr 10, 2026 By Splunk In Splunk

Transform reactive incident response with Splunk’s troubleshooting agents, designed to drastically reduce mean time to identify and resolve issues. This session demonstrates how a multi-agent approach empowers teams of all skill levels to pinpoint root causes, prioritize issues by business impact, and prevent future outages. Tech Talk sessions offer insightful and valuable deep-dives for any technical practitioner.

View Video

Splunk

Read more about Tech Talk | AI Agents in O11y Cloud

Telegraf Controller and Agent Observability

Apr 10, 2026 By InfluxData In InfluxData

Telegraf Controller makes it easier to manage and monitor your Telegraf agents in one place. In this overview, Product Manager Scott Anderson explains how it works. Agents pull their configurations directly from the controller and report their status back using a heartbeat plugin. This gives you a clear, real-time view of your deployment health. You can quickly see how everything is running at a high level or drill into individual agents for more detail. It's a simple way to stay on top of large Telegraf setups.

View Video

InfluxData

Read more about Telegraf Controller and Agent Observability

When Your Observability Literally Stops Traffic

Apr 9, 2026 By Alan Mon In Speedscale

Last week, a fleet of autonomous robotaxis in China suddenly stopped working—at scale. Over a hundred vehicles stalled across a city, stranding passengers in traffic and raising immediate concerns about safety, reliability, and trust in autonomous systems. This wasn’t just a bad day for self-driving cars. It was a distributed systems failure, one that happened in the physical world, not just in dashboards.

Read Post

Speedscale

Read more about When Your Observability Literally Stops Traffic

Uncertainty and Change Are Everywhere in Software Development

Apr 9, 2026 By Douglas Soo In Honeycomb

If you’re like everyone else who works in software development, it’s a good bet that almost every single thing that you thought you knew about your business and engineering has changed as a result of the advent of modern LLMs. How should you respond to these changes? How should you change how you and your team develop software?

Read Post

Honeycomb

Read more about Uncertainty and Change Are Everywhere in Software Development

Introducing OrionIQ: The End of Manual Observability

Apr 9, 2026 By Tomer Levy In logz.io

OrionIQ is Logz.io’s new agentic observability platform designed to move teams from detecting issues to resolving them automatically. As AI accelerates software development, operations remain manual: engineers still wake up at 2 a.m. to investigate alerts and rebuild context. OrionIQ uses AI agents to analyze real-time telemetry, investigate incidents, identify root causes, and take action across systems.

Read Post

logz.io

Read more about Introducing OrionIQ: The End of Manual Observability

Intro to Digital Experience Analytics in Splunk Observability Cloud

Apr 9, 2026 By Splunk In Splunk

See how Digital Experience Analytics in Splunk Observability Cloud helps you understand real user behavior, troubleshoot conversion drop-offs, and measure feature adoption—all from a single platform.

View Video

Splunk

Read more about Intro to Digital Experience Analytics in Splunk Observability Cloud

Modern Observability is NOT Enough #speedscale #observability #aiagents #aicoding #devops #coding

Apr 9, 2026 By Speedscale In Speedscale

Learn more: speedscale.com.

View Video

Speedscale

Read more about Modern Observability is NOT Enough #speedscale #observability #aiagents #aicoding #devops #coding

OpenTelemetry Collector + Uptrace: From Zero to Your First Traces

Apr 9, 2026 By Uptrace In Uptrace

Learn how to set up the OpenTelemetry Collector and connect it to Uptrace for distributed tracing, metrics, and logs. This step-by-step guide walks you through installation, configuration, and sending your first telemetry data — perfect for beginners and anyone looking to level up their observability stack.

View Video

Uptrace

Read more about OpenTelemetry Collector + Uptrace: From Zero to Your First Traces

AI agent observability: The developer's guide to agent monitoring

Apr 7, 2026 By Sergiy Dybskiy In Sentry

Most "agent observability best practices" content reads like a compliance checklist from 2019 with "AI" pasted over "microservices." Implement comprehensive logging. Establish evaluation metrics. Create governance frameworks. Not a single line of code. No mention of what happens when your agent silently picks the wrong tool on turn 3 and you need to figure out why.

Read Post

Sentry

Read more about AI agent observability: The developer's guide to agent monitoring

Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA

Apr 7, 2026 By Tohn Furutani In Datadog

This guest blog post is by Tohn Furutani, SRE Engineer at NTT DATA. Over the past year, the conversation around generative AI has shifted from single-shot use cases—such as summarization, Q&A, and chat interfaces—to agentic AI systems that can make decisions based on context, plan multistep actions, invoke tools, and adapt as conditions change.

Read Post

Datadog

Read more about Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA

LLM Cost Monitoring with OpenTelemetry

Apr 7, 2026 By Alexandr Bandurchin In Uptrace

Teams running LLM applications in production face a cost problem that traditional APM tools were never designed to solve. CPU and memory costs are relatively predictable — a web service processing 1,000 requests per second costs roughly the same week over week. LLM API costs are not. A single user session can cost $0.01 or $5 depending on prompt length, model choice, conversation history, and how many retries happen inside your chain.

Read Post

Uptrace

Read more about LLM Cost Monitoring with OpenTelemetry

Top 5 Continuous Monitoring Tools and Why Runtime Context Is the Layer They Are Missing

Apr 7, 2026 By Lightrun Team In Lightrun

Continuous monitoring tools track system health, performance, and behavior in real time across production environments. For a deeper understanding of how this fits into modern DevOps practices, see this guide on continuous monitoring and its impact on DevOps. They collect logs, metrics, and distributed traces across the infrastructure and application layers, giving engineering teams visibility into how their systems are running, where anomalies occur, and when something needs immediate attention.

Read Post

Lightrun

Read more about Top 5 Continuous Monitoring Tools and Why Runtime Context Is the Layer They Are Missing

Honeycomb Is Built for the Agent Era. Here's the Proof - Part 1

Apr 6, 2026 By Ken Rimple In Honeycomb

The agent era is here. Engineering teams are shipping AI-powered products, deploying multi-agent systems, and trying to figure out what observability even means for non-deterministic systems.

Read Post

Honeycomb

Read more about Honeycomb Is Built for the Agent Era. Here's the Proof - Part 1

AI Working for You: MCP, Canvas, and Agentic Workflows - Part 2

Apr 6, 2026 By Ken Rimple In Honeycomb

In our previous post in our series on observability for the agent era, we looked at how Honeycomb provides unique visibility into LLMs operating in your production environment. Now, let’s flip it around and explore how Honeycomb provides observability insights uniquely suited to helping your AI agents rapidly diagnose and fix production issues, and build production feedback into the next round of development.

Read Post

Honeycomb

Read more about AI Working for You: MCP, Canvas, and Agentic Workflows - Part 2

The Fundamentals: Fast, Deep, and Ready for What Comes Next - Part 3

Apr 6, 2026 By Ken Rimple In Honeycomb

The previous two posts in this series have looked at some of the use cases Honeycomb customers are implementing to observe LLMs in production and power agentic observability workflows. In this third and final post, we’ll take it back to basics and look at how the fundamental capabilities and infrastructure of Honeycomb provide the comprehensive data and fast performance that makes these use cases work at production scale. AI capabilities built on a weak observability foundation fall apart fast.

Read Post

Honeycomb

Read more about The Fundamentals: Fast, Deep, and Ready for What Comes Next - Part 3

Observability in Go: Where to start and what matters most

Apr 6, 2026 By Grafana Labs Team In Grafana

Sometimes the hardest part of debugging a system isn’t fixing the problem—it’s figuring out what’s actually happening in the first place.

Read Post

Grafana

Read more about Observability in Go: Where to start and what matters most

We Know Before it Breaks: Observability-Driven Development

Apr 6, 2026 By Datadog In Datadog

When stakeholders push for faster growth (new markets, new features, newly modernized stack) your engineering model has to change too. At FitnessPassport, the shift from offshore waterfall delivery to an in-house team meant rebuilding not just services, but confidence: legacy systems with weak logging and little visibility made it hard to know whether changes were working and impossible to spot issues before users did. In this talk, Director of Engineering Rob Mitchell will share how FitnessPassport adopted Datadog and used structured logs, metrics, and traces to tighten feedback loops.

View Video

Datadog

Read more about We Know Before it Breaks: Observability-Driven Development

Paris | Observability Unleashed - Boostez vos opérations IT, DevOps & SRE

Apr 3, 2026 By Splunk In Splunk

La complexité des environnements IT ne cesse de croître. La visibilité en temps réel n'est plus une option. Le 14 avril 2026, Stéphane Estevez , EMEA Observability Market Advisor chez Splunk, vous invite chez Cisco à Paris pour un événement dédié à l'observabilité, avec les équipes Splunk & Cisco. Au programme : Observabilité assistée par l'IA Stratégies de données intégrées OpenTelemetry simplifié De la donnée à l'action, avec des cas concrets et démos live Observabilité pour l'IA et par l'IA.

View Video

Splunk

Read more about Paris | Observability Unleashed - Boostez vos opérations IT, DevOps & SRE

KubeCon + CloudNativeCon EU 2026: What We Learned About AI, Observability, and Fast Feedback Loops

Apr 3, 2026 By Abdullah Chowdhury In Honeycomb

Honeycomb was excited to attend KubeCon + CloudNativeCon Europe, where one theme stood out across sessions: as AI reshapes how software is built and run, teams are being pushed to rethink how they understand their systems. Without strong observability and feedback loops, AI can accelerate confusion, misalignment, and operational risk.

Read Post

Honeycomb

Read more about KubeCon + CloudNativeCon EU 2026: What We Learned About AI, Observability, and Fast Feedback Loops

The Business Case for AI-Driven Observability in Network Operations

Apr 3, 2026 By Dallon Robinette In Selector

Modern network operations generate an extraordinary amount of telemetry. Metrics, logs, events, topology data, cloud signals, and service context all contribute to a richer picture of system behavior. As environments expand across cloud, data center, edge, and SaaS, the opportunity for operations teams is clear: when that telemetry is unified and understood in context, it becomes a powerful source of resilience, efficiency, and business insight.

Read Post

Selector

Read more about The Business Case for AI-Driven Observability in Network Operations

When we say "Observability AI Reckoning," what are we actually talking about?

Apr 3, 2026 By Virtana In Virtana

We’ve spent the last decade collecting more telemetry. Now AI is analyzing it. Here’s the catch: AI needs the full dependency chain to reason correctly. If it sees spans but not storage contention… Services but not Kubernetes scheduling… Frontend metrics but not downstream providers… It will confidently optimize the wrong thing. AI doesn’t lower the need for observability. It raises the standard.

View Video

Virtana

Read more about When we say "Observability AI Reckoning," what are we actually talking about?

Profiling Java apps: breaking things to prove it works

Apr 3, 2026 By Nikolay Sivko In Coroot

Coroot already does eBPF-based CPU profiling for Java. It catches CPU hotspots well, but that's all it can do. Every time we looked at a GC pressure issue or a latency spike caused by lock contention, we could see something was wrong but not what. We wanted memory allocation and lock contention profiling. So we decided to add async-profiler support to coroot-node-agent. The goal: memory allocation and lock contention profiles for any HotSpot JVM, with zero code changes. Here's how we got there.

Read Post

Coroot

Read more about Profiling Java apps: breaking things to prove it works

Operations | Monitoring | ITSM | DevOps | Cloud