Operations | Monitoring | ITSM | DevOps | Cloud

Sponsored Post

Understanding the Three Pillars of Observability: Logs, Metrics and Traces

Many people wonder what the difference is between monitoring vs. observability. While monitoring is simply watching a system, observability means truly understanding a system's state. DevOps teams leverage observability to debug their applications, or troubleshoot the root cause of system issues. Peak visibility is achieved by analyzing the three pillars of observability: Logs, metrics and traces. Depending on who you ask, some use MELT as the four pillars of essential telemetry data (or metrics, events, logs and traces) but we'll stick with the three core pillars for this piece.

New in the Honeycomb Academy: Learn to Use the Honeycomb MCP

Two things happen when engineers first connect the Honeycomb MCP to their AI assistant. The first is the blank page problem. The Honeycomb UI gives you something to react to: a heatmap, a query builder, a trace to click into. An AI assistant gives you a cursor and nothing else. When you don't know where to start, that's a hard place to be. The second shows up right after you get past the first one. You ask a question, you get a confident-sounding answer, and you're not sure whether to trust it.

Service-Centric Observability as the Control Layer

If distributed architectures have altered how systems degrade, then the way organizations model operational must evolve accordingly. Threshold monitoring evaluates individual metrics. Correlation clusters related alerts. Neither, on its own, explains how instability in one component alters exposure across an interconnected service landscape. In conversations at Nexus Live 2025, ScienceLogic’s annual customer conference, leaders described this distinction with clarity.

Get observability in the terminal, for you and your agents, with the gcx CLI tool

The way you write code is changing, which means the way you observe your systems and respond to issues needs to change, too. Engineers today spend much of their day working via command line, as agentic tools like Cursor and Claude Code have become highly effective at handling many day-to-day engineering tasks. This greatly accelerates code generation, but it doesn't solve for the context switching that comes when you have to jump into another tool that's not part of this new, faster workflow.

State of Observability in Financial Services 2026: From implementation to business impact

The demands on financial services companies are intensifying rapidly. They must not only deliver seamless system performance but also control costs, secure sensitive data, and maximize the value of their observability investments. To navigate these converging pressures, leaders are evolving their approach to system monitoring and telemetry. The 2026 State of Observability in Financial Services research report reveals a fundamental shift in how organizations manage their digital infrastructure.

What "AI-Ready Data" actually means for observability teams

Many organizations deploying AI are learning similar lessons right now: the challenge isn’t this or that AI model, it’s the data. According to Gartner, 60% of AI projects will be abandoned by organizations because of failures to support these projects with AI-ready data. Also, 63% of organizations either lack or aren’t sure they have the right data management practices to get there.

Approaching the Parhelion

One early spring morning in 1535, the residents of Stockholm awoke to a most curious sight. Six suns lit up the sky, connected by bright halos, as immortalized in Vädersolstavlan, seen here. Today, we recognize these atmospheric effects as a parhelion (also referred to as ‘sun dogs’)—an illusion caused by light refracting off crystalline formations in the atmosphere.

Zero-config Go heap profiling

Coroot's node-agent already collects CPU profiles for any process on the node using eBPF, with zero integration from the application side. For Java, we dynamically inject async-profiler into the JVM to get memory and lock profiles. But Go processes were still a blind spot for non-CPU profiling unless the app exposed a pprof endpoint and the cluster-agent scraped it. We wanted the same zero-config experience for Go heap profiles. This post is about how we got there.

Not All Telemetry Requires Premium Pricing

Observability in software is often framed as a choice between self-hosted and SaaS: manage it yourself, or pay a vendor to handle your data. Both self-hosted and SaaS approaches have their merits, but assuming you must choose one exclusively over the other leads to poor trade-offs: either overcommitting to an all-in-one SaaS despite spiraling costs, or fully self-hosting when it’s unnecessary.

Code Agents Need Observability

For those of us using tools like Claude Code, Codex, or Gemini, we already know they’re powerful. They can write code, refactor functions, open PRs, even run commands. For a lot of developers, they’re already part of the daily workflow. But once you zoom out beyond the individual developer, the biggest problem isn’t productivity. It’s control. AI coding tools are powerful, but they introduce a new, unpredictable cost layer that most teams don’t fully understand.

Managing OpenTelemetry Semantic Convention Migrations With the Collector

Real production data tells the story better than I can. Juraci Paixão Kröhling, a friend and fellow observability practitioner at OllyGarden, recently shared an example from an anonymized production environment: 1,830 occurrences of http.url and 23,984 occurrences of url.full in the same dataset. Both attributes describe the same thing. Both are actively being written to the same backend at the same time.

Beyond Uptime: Building a Self-Healing OpenClaw Observability Stack

The allure of OpenClaw is undeniable. You deploy a highly autonomous, self-hosted AI agent, give it access to your repositories and inboxes, and watch it reason through complex workflows while you sleep. It is the dream of the ultimate 10x developer tool realized. But as any veteran DevOps engineer will tell you: running an LLM-backed Node.js agent in production is vastly different from testing it on your local machine.

Observability Focus: Why It Became the Default Language of Modern IT Operations

Digital services run on fragile highways of microservices, containers, and event streams. Outages no longer hide inside a single server rack; they ripple across regions and ruin brand trust in minutes. Because uninterrupted insight now decides whether a launch soars or stalls, engineers treat observability as the vocabulary for every architectural choice, deployment ritual, and post-incident review. Similar discipline emerges in studios that refine professional end-to-end game dev workflows, where frame drops and lag spikes receive the same diagnostic rigor expected of banking APIs.

What Is LLM Observability? For CFOs And Engineers, The Missing Layer Is Cost

You probably have Datadog. Maybe New Relic, maybe Dynatrace. Your observability stack has been solid for years — and you're still flying blind on AI cost. Here's why LLM observability needs a fourth pillar most tools skip, and how to build one that actually tells you what your models are costing you per request, per feature, per customer.
Sponsored Post

From Microsoft SCOM to Dashboards

System Center Operations Manager (SCOM) remains one of the most capable on-premises monitoring platforms for Microsoft environments. However, as IT operations evolve toward real-time observability and self-service insights, traditional SCOM reporting and consoles can feel restrictive. This whitepaper explores practical ways to extend and modernize your SCOM visualizations using today's leading dashboarding technologies - including SquaredUp, Grafana, Power BI, and Azure Workbooks.

Moving Beyond SolarWinds: Building a Modern Observability Strategy

For years, platforms like SolarWinds have been a standard in IT environments. They helped teams answer a fundamental question: are systems up or down? That approach worked well when environments were more contained and predictable. The challenge is that most environments no longer operate that way. Hybrid infrastructure, cloud services, and tightly interconnected applications have changed what “visibility” needs to mean.

Bringing observability data hosting to the UK on AWS

UK organizations are increasingly required to design systems that account for data residency requirements, ensuring that operational data remains within national boundaries. Many teams already run their applications on AWS infrastructure in the UK, but telemetry data can still be processed outside the region, creating gaps in visibility. Datadog’s upcoming UK availability zone solves this by keeping telemetry data in the same region as the workloads that generate it.

No more monkey-patching: Better observability with tracing channels

Almost every production application uses a number of different tools and libraries,whether that’s a library to communicate with a database, a cache, or frameworks like Nest.js or Nitro. To be able to observe what’s going on in production, application developers reach out for Application Performance Monitoring (APM) tools like Sentry. But there’s an inherent problem: the performance data that APM tools need is most often not coming natively from the libraries themselves.

AI Observability in Grafana Cloud: A complete solution for monitoring your agentic workloads

The observability industry has developed great tools for using metrics, logs, traces, and profiles to monitor the cloud native applications that have dominated the last decade of software development. But when it comes to understanding what an AI system is actually doing, we’re often left reading raw conversations, guessing at quality, and reacting too late. And that’s a problem.

Introducing o11y-bench: an open benchmark for AI agents running observability workflows

Evaluating agents is hard. Verifying observability tasks is harder. Yes, AI agents have gotten dramatically and quantifiably better at coding and tool use, but observability presents a different kind of challenge. In a real incident, the hard part is rarely just writing a query. It's deciding which signal matters, figuring out whether a spike is noise or symptom, correlating metrics with logs and traces, and sometimes making a change in Grafana without breaking the dashboard another engineer depends on.

Fast AI Feedback Loops with Honeycomb and OpenTelemetry

Are you writing agentic applications, but aren’t sure what the agents are doing? Finding out too late that you've blown the budget with super expensive models? Not sure where the agents are failing, and feeling a loss of control? Could they do better? Observability is the visibility you need to get the job done. Sending telemetry to Honeycomb explains what your agents are actually doing.

How to solve key site reliability engineering challenges

Modern site reliability engineering challenges stem from the difficult requirement of confirming why complex systems fail in ways staging cannot replicate. While observability tools signal failures, and AI SREs reason over data, they leave observability gaps regarding the actual state of running code. By utilizing runtime context, teams capture live execution data to accelerate production debugging, resolving incidents in minutes without requiring manual redeploy cycles.

How Observability Powers Autonomous IT in Hybrid Environments

Autonomous IT only works when observability gives it the context to act with confidence. On any given day, a mid-size enterprise generates tens of thousands of alerts across on-prem infrastructure, multiple clouds, SaaS tools, Internet dependencies, and AI workloads. Most of them don’t need a human. A few of them do. Telling the difference, fast enough to matter, is exactly where IT teams are losing ground.

Uptrace MCP Server: Auto-Generate Dashboards with AI in Minutes

Tired of clicking through menus to build observability dashboards? In this video I walk through how to configure the Uptrace MCP (Model Context Protocol) server and connect it to an AI assistant so your dashboards get created automatically from natural-language prompts. You'll learn how to: By the end you'll have a working setup where describing what you want to monitor is enough to get a real, shareable dashboard in Uptrace.

Observability is a design problem: Live Laugh Logs ep. 1 - KubeCon Amsterdam 2026

What happens when 20,000 engineers descend on Amsterdam to talk about Kubernetes and AI? Welcome to Episode 1 of Live Laugh Logs, the podcast from Annie, Lewis and Andre from the Coralogix Developer Relations team where we will get together and recap everything going on in our worlds! We had an amazing time at KubeCon in Amsterdam and had loads of insights from the talks we went to around designing observability systems, all the AI tools being created and how to observe them, and using agent-generated code.

Building Audit-Ready Observability for Digital Banking

Most observability platforms are built to answer one question: what’s broken right now. Regulators are asking a different one: what happened, exactly, and can you prove it? Digital banking operates under constant regulatory scrutiny, where frameworks like DORA, PCI-DSS, and GDPR require every incident to be fully reconstructed across systems, timelines, and access. Systems can recover quickly, but the ability to explain what happened often remains fragmented across tools and teams.

Centralize observability management with Datadog Governance Console

As organizations grow, they face increasing difficulty in managing their observability efforts. More teams mean more dashboards, monitors, API keys, pipelines, and custom configurations. Without a centralized view, administrators spend hours chasing down untagged resources, investigating surprise bills, and revoking dormant credentials. Governance becomes a reactive effort to reduce waste and address issues, falling short of its potential to proactively create standards and optimize observability.

You Don't Need Three Pillars, You Need Single Threads

Last week was a great reminder for me about the challenges of the traditional model of observability defined by the “three pillars” of metrics, logs, and traces. One of the customers I’m currently working with is a large financial institution that has a robust three pillar implementation. Every critical application ships their telemetry to either or both their cloud-native tool and a central tool.

Building a Unified Enterprise Observability Strategy Webinar

Join Graham Davies, Technical Product Manager at SquaredUp as he provides a practical guide to breaking down data silos between IT, operations and the business. In this session, Graham digs into why dashboard and tool sprawl is making decisions harder, not easier, and shows you a practical framework for building a single source of truth your whole organisation can rely on.

The End of Manual Instrumentation: Scaling Observability with OTel OBI & Coralogix

Traditionally, achieving deep visibility into distributed systems required significant trade-offs in engineering time. Collecting meaningful application metrics and traces required teams to embed language-specific agents, modify source code, or manage complex library dependencies across every service.

What Is an AI SRE? And Why Do They Need Live Runtime Evidence?

AI SREs are autonomous systems that handle incident triage, root cause analysis, and remediation by correlating logs, metrics, traces, and code signals. However, as they rely on pre-configured telemetry, the critical execution details of a specific failure, such as variable state and code paths, can often be missed. As a result, they either force users into manual redeploy loops or make inferences from partial data, diagnosing issues using probability rather than proof.

Fewer Tools, Faster Fixes: A Practical Guide to Observability Consolidation

Most observability stacks aren’t designed, they accumulate. A logging tool here, a tracing platform there, and before you know it you’re managing rising costs and a setup that ultimately slows down your team. And you’ve moved further away from actually solving problems for your users.

ICYMI: Is This Code Worth Running? Here's How to Know

Over the last three months, we’ve been exploring what about software development and observability changes with AI, and what doesn’t. Our conclusion: these five principles will still remain true, even when 90% of the code is AI-driven. The agentic AI space is moving fast. Models are improving, context windows are expanding, and the ways people build and operate agents are changing so fast that any thoughts we share could feel dated by the time you read this.

Optimizing the OpenTelemetry Python SDK for LLM Workloads

Agentic workloads thrive with precision tooling. Just like developers, they need the rich context, high cardinality, and fast feedback loops that allow them to ask exploratory open-ended questions of their code. But instrumentation is costly, and from the dawn of software, developers have tried to do the most possible with the least amount of resources.

Top 6 AI SRE Tools and Why Runtime-Grounded Reliability Is the New Standard

AI SRE tools accelerate incident detection, root cause analysis, and remediation across distributed production systems. They ingest telemetry signals, including logs, metrics, traces, alerts, and deployment history, to correlate anomalies, narrow fault domains, and reduce manual triage. This guide breaks down the top AI SRE tools in 2026 and helps you choose the right one based on your team’s biggest bottleneck, whether that is faster triage, deeper root cause analysis, or runtime-level validation.

Beyond the Dashboard: Selector's Patented Approach to Conversational Observability

For years, IT operations teams have been trapped in a frustrating paradox: the data they need to solve critical issues is right at their fingertips, yet entirely out of reach. Accessing it requires engineers to master complex, platform-specific query languages, dig through endless layers of dashboards, and hunt for the exact visualization that holds the answer. Under the intense pressures of modern speed, scale, and complexity, this rigid model is breaking down.

Tech Talk | AI Agents in O11y Cloud

Transform reactive incident response with Splunk’s troubleshooting agents, designed to drastically reduce mean time to identify and resolve issues. This session demonstrates how a multi-agent approach empowers teams of all skill levels to pinpoint root causes, prioritize issues by business impact, and prevent future outages. Tech Talk sessions offer insightful and valuable deep-dives for any technical practitioner.

When Your Observability Literally Stops Traffic

Last week, a fleet of autonomous robotaxis in China suddenly stopped working—at scale. Over a hundred vehicles stalled across a city, stranding passengers in traffic and raising immediate concerns about safety, reliability, and trust in autonomous systems. This wasn’t just a bad day for self-driving cars. It was a distributed systems failure, one that happened in the physical world, not just in dashboards.

Uncertainty and Change Are Everywhere in Software Development

If you’re like everyone else who works in software development, it’s a good bet that almost every single thing that you thought you knew about your business and engineering has changed as a result of the advent of modern LLMs. How should you respond to these changes? How should you change how you and your team develop software?

Introducing OrionIQ: The End of Manual Observability

OrionIQ is Logz.io’s new agentic observability platform designed to move teams from detecting issues to resolving them automatically. As AI accelerates software development, operations remain manual: engineers still wake up at 2 a.m. to investigate alerts and rebuild context. OrionIQ uses AI agents to analyze real-time telemetry, investigate incidents, identify root causes, and take action across systems.

OpenTelemetry Collector + Uptrace: From Zero to Your First Traces

Learn how to set up the OpenTelemetry Collector and connect it to Uptrace for distributed tracing, metrics, and logs. This step-by-step guide walks you through installation, configuration, and sending your first telemetry data — perfect for beginners and anyone looking to level up their observability stack.

Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA

This guest blog post is by Tohn Furutani, SRE Engineer at NTT DATA. Over the past year, the conversation around generative AI has shifted from single-shot use cases—such as summarization, Q&A, and chat interfaces—to agentic AI systems that can make decisions based on context, plan multistep actions, invoke tools, and adapt as conditions change.

LLM Cost Monitoring with OpenTelemetry

Teams running LLM applications in production face a cost problem that traditional APM tools were never designed to solve. CPU and memory costs are relatively predictable — a web service processing 1,000 requests per second costs roughly the same week over week. LLM API costs are not. A single user session can cost $0.01 or $5 depending on prompt length, model choice, conversation history, and how many retries happen inside your chain.

Top 5 Continuous Monitoring Tools and Why Runtime Context Is the Layer They Are Missing

Continuous monitoring tools track system health, performance, and behavior in real time across production environments. For a deeper understanding of how this fits into modern DevOps practices, see this guide on continuous monitoring and its impact on DevOps. They collect logs, metrics, and distributed traces across the infrastructure and application layers, giving engineering teams visibility into how their systems are running, where anomalies occur, and when something needs immediate attention.

AI agent observability: The developer's guide to agent monitoring

Most "agent observability best practices" content reads like a compliance checklist from 2019 with "AI" pasted over "microservices." Implement comprehensive logging. Establish evaluation metrics. Create governance frameworks. Not a single line of code. No mention of what happens when your agent silently picks the wrong tool on turn 3 and you need to figure out why.

AI Working for You: MCP, Canvas, and Agentic Workflows - Part 2

In our previous post in our series on observability for the agent era, we looked at how Honeycomb provides unique visibility into LLMs operating in your production environment. Now, let’s flip it around and explore how Honeycomb provides observability insights uniquely suited to helping your AI agents rapidly diagnose and fix production issues, and build production feedback into the next round of development.

The Fundamentals: Fast, Deep, and Ready for What Comes Next - Part 3

The previous two posts in this series have looked at some of the use cases Honeycomb customers are implementing to observe LLMs in production and power agentic observability workflows. In this third and final post, we’ll take it back to basics and look at how the fundamental capabilities and infrastructure of Honeycomb provide the comprehensive data and fast performance that makes these use cases work at production scale. AI capabilities built on a weak observability foundation fall apart fast.

We Know Before it Breaks: Observability-Driven Development

When stakeholders push for faster growth (new markets, new features, newly modernized stack) your engineering model has to change too. At FitnessPassport, the shift from offshore waterfall delivery to an in-house team meant rebuilding not just services, but confidence: legacy systems with weak logging and little visibility made it hard to know whether changes were working and impossible to spot issues before users did. In this talk, Director of Engineering Rob Mitchell will share how FitnessPassport adopted Datadog and used structured logs, metrics, and traces to tighten feedback loops.

Paris | Observability Unleashed - Boostez vos opérations IT, DevOps & SRE

La complexité des environnements IT ne cesse de croître. La visibilité en temps réel n'est plus une option. Le 14 avril 2026, Stéphane Estevez , EMEA Observability Market Advisor chez Splunk, vous invite chez Cisco à Paris pour un événement dédié à l'observabilité, avec les équipes Splunk & Cisco. Au programme : Observabilité assistée par l'IA Stratégies de données intégrées OpenTelemetry simplifié De la donnée à l'action, avec des cas concrets et démos live Observabilité pour l'IA et par l'IA.

KubeCon + CloudNativeCon EU 2026: What We Learned About AI, Observability, and Fast Feedback Loops

Honeycomb was excited to attend KubeCon + CloudNativeCon Europe, where one theme stood out across sessions: as AI reshapes how software is built and run, teams are being pushed to rethink how they understand their systems. Without strong observability and feedback loops, AI can accelerate confusion, misalignment, and operational risk.

The Business Case for AI-Driven Observability in Network Operations

Modern network operations generate an extraordinary amount of telemetry. Metrics, logs, events, topology data, cloud signals, and service context all contribute to a richer picture of system behavior. As environments expand across cloud, data center, edge, and SaaS, the opportunity for operations teams is clear: when that telemetry is unified and understood in context, it becomes a powerful source of resilience, efficiency, and business insight.

When we say "Observability AI Reckoning," what are we actually talking about?

We’ve spent the last decade collecting more telemetry. Now AI is analyzing it. Here’s the catch: AI needs the full dependency chain to reason correctly. If it sees spans but not storage contention… Services but not Kubernetes scheduling… Frontend metrics but not downstream providers… It will confidently optimize the wrong thing. AI doesn’t lower the need for observability. It raises the standard.

Profiling Java apps: breaking things to prove it works

Coroot already does eBPF-based CPU profiling for Java. It catches CPU hotspots well, but that's all it can do. Every time we looked at a GC pressure issue or a latency spike caused by lock contention, we could see something was wrong but not what. We wanted memory allocation and lock contention profiling. So we decided to add async-profiler support to coroot-node-agent. The goal: memory allocation and lock contention profiles for any HotSpot JVM, with zero code changes. Here's how we got there.