Operations | Monitoring | ITSM | DevOps | Cloud

Your telemetry, your apps. Inside apps on the Cribl Platform

You already use Cribl to tame your telemetry data. Now you can turn that data into apps your teams actually want to use. In this video, we walk through how to create apps in the Cribl Platform and show how real apps solve real problems: guided troubleshooting for noisy incidents, opinionated security views, and exec-friendly ROI dashboards. You’ll see how apps sit on top of Cribl Stream, Edge, Search, and Lake, so you reuse the data and logic you already have instead of building custom tools from scratch.

Unified observability for Alibaba Cloud with Datadog

Alibaba Cloud is a major cloud provider in APAC, offering industry-leading foundational AI models in addition to compute, managed databases, object storage, and Kubernetes through its Container Service for Kubernetes (ACK). Teams choose Alibaba Cloud for its infrastructure availability across Asia Pacific and its managed services. For SREs and platform engineers, that often means running Alibaba Cloud alongside AWS, Google Cloud, or Microsoft Azure.

The Kubeshark Workflow That Doesn't Stop at the Dashboard

The Observability Gap shows up the moment you try to reproduce a production bug locally. Your traces tell you a request was slow. Your logs tell you which line printed. Neither tells you what was actually on the wire: the headers, the JSON body, the surprise field your client started sending last Tuesday. Until now, closing that gap meant SSHing to a node, attaching a debugger, or shipping a sidecar through change review.

What is AI-Powered Observability? A Complete Guide for IT Teams in 2026

Is your monitoring stack really giving you clarity, or just more alerts? Your monitoring stack is probably working exactly as designed. That is the problem. As systems grow, most IT and platform teams start to see the same patterns: At this point, traditional monitoring starts to feel limited. This is where teams begin exploring AI in observability. In this guide, we will explain what AI-powered observability actually means, how it works, and when it is useful.

Everything We Talked About at O11yCon 2026

We just wrapped O11yCon 2026, and this year's conversations hit differently. Agent-based software development is here, now. It's no longer an optional choice, and everybody is struggling to understand what their agents are doing and how to make them cost less and perform better. Over the course of fifteen talks, we saw clearly that the old assumptions on how and who (or what) writes our software has been upended. Here are some highlights. We'll have videos available in the near future.

Search Azure Blob data in-place with BYOS for Cribl Lake

See how Bring Your Own Storage (BYOS) in Cribl Lake allows teams to connect directly to Azure Blob Storage and instantly search data in place — without moving, duplicating, or rehydrating telemetry. In this demo, Cribl Product Manager Risk Salsa walks through setup, dataset creation, and how to run fast investigations across your Azure-hosted data using Cribl Search.

Observability Expanding Beyond Infrastructure and Into AI Systems

Observability revolves essentially around understanding infrastructure health. This means that operations teams monitor applications, netwo0rks, database and cloud environments using familiar signals. They use logs, metrics, latency, uptime measurements, and traces. If systems remain available and the performance stays within expected thresholds, the teams have enough visibility to understand whether applications are functioning properly.

Inside the Grafana AI Team Weekly: Guards for AI Observability (May 5, 2026)

This is an excerpt from a real AI team weekly meeting where we talk about the stuff we build and occasionally also demo them! In this one, Principal Software Engineer Sven Großmann shows a new feature he's working on for AI Observability, called "guards". We're showing parts of our team meetings to build in public in some small way and give you a sneak preview of what's to come. But not all features we show may make it to production! You've been warned. :)

Your Microsoft Azure storage, our data lake power: The best of both worlds

The wait is over for Azure-first organizations. Cribl just launched Cribl Lake Bring Your Own Storage (BYOS) for Microsoft Azure, giving you full data lake power without moving a single byte of telemetry out of your environment. Join us to see how you can finally get the flexibility of a modern data lake while keeping your data in Azure.

Why Traditional Observability Breaks Down in Hybrid Cloud Environments

Hybrid cloud has reshaped the way enterprises build, run, and troubleshoot digital services. Applications now stretch across on-premises infrastructure, cloud platforms, regional services, interconnects, and distributed dependencies that change constantly. Operational complexity has expanded with that footprint, yet many observability practices still reflect assumptions from an earlier era of simpler architectures and clearer boundaries. That gap shows up fast during an incident.

The Complete Guide to Observability Pipelines

Modern engineering teams are drowning in telemetry data. A mid-sized Kubernetes cluster running 50 microservices can generate millions of log lines per minute. Add distributed traces, Prometheus metrics, cloud provider events, and application-level instrumentation and you're looking at terabytes of observability data every day. The problem isn't just volume. It's what you do with it.

How we made a SQL query optimization agent 59% more accurate using autoresearch and LLM Observability

Without experiment infrastructure to help you test your LLM applications, every research session starts with the same questions: What have we tried previously? What were the numbers? Which prompt version produced that result? Why did we discard that approach? The answers live in scattered notes, terminal history, and half-remembered conversations. Each handoff between sessions loses context. In practice, iteration can slow down as teams get bogged down in testing and analysis.

Honeycomb Canvas: The Multiplayer Workspace for the Agentic Era

Last week, we launched a major update to Canvas, our investigation workspace. The new Canvas has evolved from an AI co-pilot you chat with to a place where your whole team, human and agent, can work the same problem on the same surface. Auto-investigations begin the moment a trigger, SLO, or anomaly fires. Custom skills encode your team's runbooks so every agent investigates with your team's expertise built in.

Unlock telemetry value with a well-planned data lake

Your SIEM only holds a slice of your telemetry. Your data lake holds the rest. We'll show you how to use that to your advantage for investigations, threat hunting, and reporting. Why your data lake beats your SIEM for investigations – Your SIEM keeps a short window of expensive, filtered data. Your data lake keeps everything. When something goes wrong, that difference matters more than you think Threat hunting without the handcuffs – Hunting across months of data in a SIEM is painful and costly. We'll show you how a well-planned lake makes broad, deep searches practical and affordable.

AI Observability In 2026: What It Is, The Five Pillars, And Why Cost Is The One Everyone Skips

AI observability covers performance, quality, reliability, safety, and cost. Most tools handle the first four. Here's what each pillar means, which tools cover which, and why cost is the dimension enterprises keep missing.

Agent Timeline: The Flight Recorder for Your AI Agents

Last week, we introduced Agent Timeline, a powerful new observability experience purpose-built for debugging AI agent workflows in production. Agent Timeline uniquely connects AI-layer visibility to full-stack observability by organizing telemetry around an agentic conversation. A conversation contains one or more agent executions, each of which may contain LLM calls, tool invocations, handoffs, retries, human escalations, and downstream system calls.

Get Lightrun AI Skills: Expert Workflows for AI Agents

Today we’re launching Lightrun AI Skills, structured, repeatable investigation workflows built for AI coding agents. With Lightrun MCP, agents like Claude Code, Codex, and Cursor can already instrument live production services and reason over live runtime evidence without a redeployment. But AI agents remain non-deterministic by design, using the same tool differently every session.

How Honeycomb Is Embracing the Challenges of End-to-End Observability with Embrace

Customers regularly come to us looking to solve their observability problem by connecting the dots from frontend to backend. It sounds straightforward in theory, but in practice it's one of the hardest problems in modern application monitoring. The frontend monitoring tools they already have in place tend to be proprietary or narrowly scoped to frontend needs, leaving them without the context-rich backend data that makes real triage possible.

Cribl Notebook templates in Cribl Search

Investigations are time-sensitive, and analysts shouldn’t waste time recreating the same workflows or rewriting familiar queries. Whether troubleshooting infrastructure, investigating suspicious IPs, or analyzing host activity, teams often rely on duplicating old processes and copying query snippets — a slow, inconsistent approach that’s hard to scale.

Honeycomb Innovation Week: Announcing Our Partnership With Embrace

Honeycomb and Embrace are extending the rigorous, data-driven practice that Honeycomb pioneered for foundational to mobile and web, giving, site reliability, and platform teams a complete, correlated picture of system health. The strategic partnership makes understanding performance and reliability for every user and every screen part of the observability practice, bringing new depth and standardization to how teams measure end user impact.

3 things you need to know about headless observability

If you're building agents trying to figure out the best way to actually make them successful in production, you're going to want to know about headless observability. Headless observability means an agent can access information about the health of your system through a CLI instead of clicking around dashboards. It's the data layer that going to unlock serious autonomy and allow you to scale with agentic workloads.

Honeycomb Achieves the AWS Financial Services Competency

Honeycomb is proud to share that we have achieved the Amazon Web Services (AWS) Financial Services Competency. This recognition validates our technical expertise and proven customer success in assisting financial services organizations with building, running, and understanding their production systems on AWS. Securing this competency is a direct response to our customers’ feedback in this space: observability in regulated, high-stakes environments requires more than dashboards and alerts.

Honeycomb Innovation Week: Debugging Agentic Workflows with Ken Rimple

Canvas skills are how your team's runbooks and tribal knowledge become an active part of the investigation instead of a document someone has to remember to open. Pre-built skills cover the most common investigation patterns out of the box. Custom skills let you encode the specific context, thresholds, and decision logic your team has accumulated, so every auto-investigation starts with your best thinking already applied.

From Monitoring to Observability: How DEX Integrations Strengthen IT Visibility and User Productivity

When I started working in IT in the last 90’s, IT performance was always measured by the health of infrastructure: CPU utilization, network latency, server uptime, and for many organizations, little has changed in the last 30+ years. We became very good at keeping systems alive, yet users still struggled to get work done. That disconnect is exactly why Digital Employee Experience (DEX) has emerged as a critical discipline. But DEX on its own is not the end goal.

Observability for the Agent Era: Day 1 | Keynotes

Honeycomb's Innovation Week: Observability for the Agent Era (May 12-14) For Day 1 of Innovation Week, Honeycomb co-founders Christine Yen and Charity Majors will share what it actually takes to understand and debug systems in the agent era, and what the best engineering teams are doing differently. A 3-Day Virtual Event for Teams Building the Future May 12: Get insights on how the best engineering teams are tackling the challenges of the agentic era.

Innovation Week Day 2: Observability for AI, and Observability With AI

AI is reshaping the SDLC in two directions at once. AI-generated code is shipping faster and with less human supervision than ever before, while agents and LLMs are running directly in production, where they behave very differently from traditional software: non-deterministic, with a wider blast radius than any single function or component, with no stack trace to catch when something goes wrong.

Honeycomb Innovation Week: Observability With AI With Kale and Taylor

Watch this video to see the re-imagined Canvas in action, where auto-investigation has already ranked your hypotheses before you open the tab, multiplayer agents build on each other's work in real time, and a custom skill encoding your team's own runbook can reprioritize the entire incident before you've had your morning coffee.

Observability for the Agent Era: Day 2 | Launches

Honeycomb's Innovation Week: Observability for the Agent Era (May 12-14) For Day 2 of Innovation Week, Honeycomb's product and engineering teams will take you inside the new capabilities purpose-built for the agent era. Expect live demos, real scenarios, and a hands-on look at what it means to own observability for the Agentic era, with AI in Honeycomb to observe AI in production. A 3-Day Virtual Event for Teams Building the Future May 12: Get insights on how the best engineering teams are tackling the challenges of the agentic era.

Innovation Week Day 1: The SDLC Is Collapsing, and Observability Has Never Mattered More

The software development lifecycle is collapsing. The multi-stage pipeline that defined how software got built and shipped for decades is compressing into rapid loops of intent and validation, with agents now part of the teams building and running it. Day 1 of Innovation Week was about what that shift means for how software gets validated, where observability fits, and the problems that have always been hard but are now genuinely urgent.

Security Integrations in Observability Self-Hosted

Integrating security data with observability data provides a comprehensive view for better threat detection and response. Security observability helps connect the dots between seemingly innocent events that, when correlated, reveal complex attack patterns. SolarWinds security products integrate into observability self-hosted, including Security Event Manager for log data and event correlation, Access Rights Management for identifying potential attack vectors, configuration management for compliance monitoring, and Patch Manager for tracking critical updates.

Turn Noisy Logs Into Structured Data with Uptrace Grouping Rules

Here are 3 YouTube title options plus a description optimized for technical/dev audiences: Same log pattern. Hundreds of useless groups. In this video, we show how to use Uptrace Grouping Rules to automatically turn noisy logs into structured, searchable data — without changing application code. You'll learn how to: Examples covered: Perfect for:#OpenTelemetry users, backend engineers, SREs, and anyone dealing with noisy logs.

Making Semantic Conventions Work for You With OpenTelemetry Weaver

Your dataset has hundreds of attributes. Some are self-explanatory: http.response.status_code, server.address. Others are not: meta.refinery.reason, dataset.slug, sli.latency_target_ms. If you don't know what an attribute means, you can't write a good query. And if an AI agent doesn't know what it means, it guesses.

Why Alert Fatigue Solutions Still Miss the Root Cause

Alert fatigue solutions have never been better, but on-call engineers are still burning out. Threshold tuning, AI triage, and alert correlation reduce the noise, but every alert that clears filtering lands with the same incomplete telemetry and triggers the same manual investigation cycle. This post explains why the evidence gap survives every fix, and how runtime context changes that.

Multi-tiered Observability: A Practical Way to Handle Diverse Workloads

Observability in large companies is rarely one-size-fits-all. The VictoriaMetrics topologies guide shows why different deployment patterns are needed as scale, isolation, and reliability requirements grow. Different workloads require different trade-offs: some need long retention for audits and trend analysis, while others need higher resolution for debugging. Business-critical systems also demand dependable alerting and high availability, often with several 9s of reliability.

Span or Attribute in OpenTelemetry Custom Instrumentation

TL;DR: Attribute. More information on one event gives us more correlation power. It’s also cheaper. When you want to add some information to your tracing telemetry, you could emit a log, create a span, or add a piece of data to your current span. Adding a piece of data to your current span is the best! Usually.

Observability and Security for the AI Era

Datadog has always been driven by a broader vision of helping teams understand and operate complex systems. In this session, you’ll hear from Michael Whetten, Product SVP, and Abrar Hussain, Senior Director, Product Management, as they share the latest updates across the Datadog product suite and discuss how that vision continues to shape the platform’s evolution and support the next generation of AI-driven applications.

Why Blast Radius Analysis Does Not End When Alerts Fire

Modern distributed systems fail in ways that can bypass even well-designed isolation patterns. When a failure is actively propagating across services at four in the morning, the question shifts from “how do we limit the blast radius” to “how do we confirm what it actually is.” Monitoring shows which services are in the impact zone, but it cannot show what code path caused the failure to spread, or whether it has stopped.

How to Prevent AI Agents From Deleting Production Data

There’s a new question teams are asking. How can we prevent AI agents from deleting production. When Cursor deleted PocketOS’s entire production database in nine seconds, the agent wasn’t malfunctioning. It had full technical capability, but it was inferring operational authority from static code rather than live environment state. That gap between capability and context is the root cause. This article breaks down exactly how that happens, and what runtime visibility does to stop it.

The cost of knowledge

In the world of observability, “cardinality” has become a heavy word. It is a ghost used to justify skyrocketing bills or degraded query performance. When cardinality rises, the advice is almost always the same: reduce it. Drop your labels, or reduce the dimensions. It is usually framed as “optimization.” Every label you add to a metric is a dimension of knowledge. Each one gives you a way to slice, compare, and explain the chaos of production.

Moving Beyond SolarWinds: A Guide to Modern Observability

Industry-leading observability experts provide strategic guidance on why and how modern IT teams are successfully moving beyond SolarWinds to more resilient, cloud-native platforms. IT teams running SolarWinds often know the pain points well before they start evaluating alternatives: separate modules for different monitoring needs, a self-hosted deployment model that requires ongoing maintenance, and pricing that gets harder to predict after each acquisition.

How Scalability Works in SolarWinds Observability Self-Hosted

Cheryl Nomanson, SolarWinds staff technical trainer, provides a comprehensive overview of SolarWinds architecture and scaling options for self-hosted deployments. She explains the centralized deployment model starting with a single SolarWinds server that handles polling, web console, and database connections. The presentation covers key scaling indicators including polling thresholds that warn users at 85% capacity and alert at 100%. She demonstrates how to add up to 100 polling engines per server and additional web servers to handle more concurrent users.

Observability vs Monitoring: What's the Real Difference in 2026?

Understand the real difference between observability and monitoring — and why modern IT teams in 2026 need both. Monitoring tells you something is broken; observability explains why. See real examples, faster troubleshooting workflows, and how Motadata ObserveOps unifies both in one platform. Don’t forget to like, share, and subscribe for more IT insights.

Introducing the Coralogix CLI: Headless Observability for Every Agent

This article is a high-level overview of the Coralogix CLI. For a deeper look at how it works in practice, read the full technical deep dive here. Agent-driven investigation sounds simple: read the alert, query the data, return the cause. In reality, most agents either overload their context window with raw logs or guess at queries and return incorrect results.

Taming Log Noise With the OpenTelemetry Collector's Drain Processor

Do you receive 50 million log lines per day and struggle to see what actually matters? Health checks, heartbeat pings, connection pool messages—they all drown out the errors and anomalies you're trying to find. Most teams deal with this by writing filter rules to drop the noisy patterns. But those rules are manual, per-pattern, and brittle. A new deployment changes a log format and the filter misses it. A new service starts logging a chatty startup sequence nobody thought to exclude.

Get Observability in the Terminal, for You and Your Agents: gcx

The way you write code is changing, which means the way you observe your systems and respond to issues needs to change, too. Engineers today spend much of their day working via command line, as agentic tools like Cursor and Claude Code have become highly effective at handling many day-to-day engineering tasks. This greatly accelerates code generation, but it doesn't solve for the context switching that comes when you have to jump into another tool that's not part of this new, faster workflow.

Why Does MTTD Stay High Despite Observability Tools Running?

Monitoring coverage, anomaly detection, and SLO-based alerting have significantly narrowed detection windows for most failure types, but MTTD remains stubbornly high for a specific silent failure. This blog covers why type mismatches, swallowed exceptions, and values that pass validation without occurring without triggering errors, and what changes when your monitoring stack can generate those signals without waiting for a failure to surface them.

Add dynamically updating context to logs with Reference Tables and Observability Pipelines

Security and platform engineering teams rely on context-rich logs to investigate threats, prioritize incidents, and meet compliance requirements. Context is often stored separately from applications that generate logs, in sources like threat intelligence feeds in Snowflake, asset lists in Amazon S3, ownership data in ServiceNow CMDB, and risk scores produced in Databricks.