Operations | Monitoring | ITSM | DevOps | Cloud

Automation Observability: See It, Fix It, Skip the Firefighting

IT leaders know the drill. An alert storm rolls in and the tickets pile up. Your team scrambles to piece together root causes before service degradation kicks in. But the firefighting rages on, even when you have enough dashboards, monitoring, and alerts to light up a Christmas tree. Enterprise leaders need to quit burning budget on shiny dashboards that look good in the boardroom but do nothing to stop outages in the real world.

Paving the way for a new era: Mezmo's Active Telemetry

The world of software development has fundamentally changed. We've moved from monthly releases to continuous delivery measured in minutes, and the rise of AI means velocity is no longer just a goal—it's a requirement for survival. But this relentless speed has exposed a critical flaw in how we approach observability. The industry relies on a "store first, ask questions later" model where you collect every log, metric, and trace, and then hope to find the root cause when something breaks.

OpenMetrics vs OpenTelemetry - A guide on understanding these two specifications

OpenMetrics and OpenTelemetry are popular standards for instrumenting cloud-native applications. Both projects are part of the Cloud Native Computing Foundation (CNCF) and aim to simplify how we generate, collect and monitor services in a modern cloud-native distributed application environment. Let's have a look at how both the standards are aiming to help solve the observability conundrum.

How to Become an SRE Engineer

Site Reliability Engineering has emerged as one of the most sought-after careers in tech, combining software engineering expertise with operational excellence. SRE engineers ensure that critical systems remain reliable, scalable, and performant while enabling rapid feature development. With the global SRE job market projected to grow by over 25% in 2025, skilled professionals in this field command competitive salaries and enjoy diverse career opportunities across industries.

Grafana Labs Co-founder Woods: Market maturity, OpenTelemetry, and AI are reshaping observability

As organizations navigate increasingly complex tech environments, unified observability practices have become essential. That was one of the main takeaways from Grafana Labs Co-founder Anthony Woods’ recent appearance on “Tech Keys by by Mercari India,” a podcast hosted by Vaibhav Khurana, Head of Platform Engineering at Mercari India.

CriblCon sneak peek with AlphaSoc

The countdown to is on and we’re giving you an exclusive first look at the expert insights, innovative solutions, and success stories you’ll see on the big stage. Join us as we chat with Chris McNab, Founder of AlphaSOC, a security startup that processes network telemetry to uncover infected hosts, emerging threats, and targeted attacks.

Monitor your data pipelines with Airflow lineage

In complex data pipelines with dozens of jobs and intermediary datasets, it can be difficult to effectively monitor how data travels and changes through various steps. When tracking issues in these pipelines, you need visibility into upstream components where the root cause may originate from, as well as downstream datasets and consumers of data that may be experiencing further impacts.

Introducing the BigPanda observability and monitoring tool rationalization framework

When enterprises run dozens of monitoring and observability tools, performance gaps almost always emerge. By applying the BigPanda Observability Scorecard, our customers consistently see their tool portfolio fall into three groups: In some cases, removing bottom-tier tools can reduce portfolio complexity by double digits while cutting operational noise by as much as 35-40%. This simplification reduces costs while creating a leaner, more reliable monitoring environment that strengthens service availability and operational efficiency.

How to analyze observability and monitoring tools for actionability

Choosing the right observability tools is critical so ensure your teams get actionable insights. In this video, we explore how to evaluate observability platforms based on their ability to detect anomalies, link causes, and trigger effective responses.

Monitor and optimize your systems with Uptrace

Uptrace is your single source of truth for monitoring, understanding, and optimizing complex distributed systems. Proven in production for over five years and trusted by more than a thousand installations worldwide, it lets you see your system like never before. What makes the difference is that Uptrace is pure OpenTelemetry, built natively from day one. This isn't a translation layer—it's a direct connection that eliminates friction and ensures zero vendor lock-in. Your homepage serves as your command center, providing complete visibility across your stack at a glance.

How to Push Prometheus Metrics to Splunk Observability Cloud with the OpenTelemetry Collector

In this video, you’ll learn how to scrape Prometheus endpoints with the OpenTelemetry Collector’s Prometheus receiver and send metrics to Splunk Observability Cloud. We’ll walk through configuring three common data sources (a Python Flask app, node_exporter for host metrics, and the NGINX Prometheus exporter), show how to enrich metrics with resource attributes, and build simple charts in Splunk Observability Cloud. You’ll see how centralized scraping and consistent tagging make it easy to manage and visualize Prometheus metrics in Splunk Observability Cloud.

Kubernetes Observability: Your Q&A Guide to Calico Whisker

Getting the most out of Whisker requires understanding its inner workings and this guide is designed to help you master this exciting tool with support from the Calico community. We’ve compiled the most frequently asked questions from our community Slack, support conversations, and CalicoCon sessions. This Q&A covers everything from initial installation tips and version requirements to advanced topics like filtering flow logs and integrating with Goldmane, the powerful API that underpins Whisker.

How to Responsibly and Effectively Contribute to Open Source Using AI

With the influx of AI tooling, it’s never been easier to contribute to open source communities. These tools are capable of gathering context quickly, “understanding” repositories faster than ever before. They provide instant summaries about repositories that, previously, would have meant reading lines and lines of code. They can fix bugs in programming languages you don’t know, and ultimately allow more contributors to get involved, which (almost) every open source project wants.

Memory stall: the agony before OOM

When we set a memory limit for a container, the expectation is simple: if the app leaks memory, the OOM killer steps in, the container dies, Kubernetes restarts it, done. But reality is messier. As a container gets close to its memory limit, allocations don’t just fail instantly. They get slower. The kernel tries to reclaim memory inside the cgroup, and that takes time. Instead of being killed right away, your app just crawls.

Your Next Observability RFP is All Wrong. Why AI Changes Everything

AI-first observability addresses two of the most pressing troubleshooting challenges: complex IT environments and AI-generated code. But understanding how to implement AI in a way that brings ROI, requires cutting through the hype and maintaining realistic expectations, while keeping a forward-thinking vision. In this blog post, we bring practical tips for including AI in your next observability RFP. The article is based on a webinar held with Logz.io founders, CEO Tomer Levy and CTO Asaf Yigal.

Integrating JMX and OpenTelemetry

The OpenTelemetry community and the contributors to the Java Special Interest Group (SIG) have spent a great deal of time integrating core Java technologies into the project. An integration that is particularly useful is Java Management Extensions (JMX). It has been around since J2SE 5, and has been mature for some time. Many of the most widely used Java applications have adopted it over time and support this extension.

The one where we talk about Cribl Guard

Manual hunts for sensitive data are slow, error-prone, and expensive. Cribl Guard combines advanced AI with a human-in-the-loop control point to spot sensitive data, such as credit card, passport, and Social Security numbers, as it flows through Cribl Stream. Whether you’re fully cloud or hybrid, Cribl Guard puts you firmly in control of every piece of sensitive information that crosses your pipes.

Instrumenting the Node.js event loop with eBPF

Recently, I was testing Coroot’s AI Root Cause Analysis on failure scenarios from the OpenTelemetry demo. One of them, loadgeneratorFloodHomepage, simulates a flood of excessive requests. As expected, it caused a latency degradation across the stack. Coroot’s RCA highlighted how the latency cascaded through all dependent services. At the same time, we noticed a moderate increase in CPU usage for the frontend service and the node itself.

LLM app Observability: Opentelemetry as a standard

LLM observability is broken There are too many new libraries floating around, but they don't follow accurately the OpenTelemetry conventions. OTel isn’t perfect for LLMs yet—but extending a proven standard beats inventing another one. Why not use the same standard (OTel) which works so well for rest of the apps, and just work on top of it? This is what I was ranting with Pranav Raj S, co-founder at Chatwoot and we thought there must be other folks facing similar issues.

OpenTelemetry Observability: An In-Depth Look at Features and Best Practices

OpenTelemetry (OTel) is a unified framework of APIs, SDKs and tools, for collecting, processing, and exporting telemetry data (logs, metrics, and traces) across applications and infrastructure. OTel is especially required in today’s cloud-native world, where applications run on microservices, Kubernetes, and distributed systems.

Observability Day San Francisco: The Future of AI and Observability Is Bright

AI and observability are no longer separate conversations—they’re deeply intertwined. Across keynotes, panels, and demos, speakers at Honeycomb's Observability Day San Francisco unpacked what that means for engineering teams today: faster insights, smarter tools, and new challenges to solve.

Monitor and optimize your systems with Uptrace

Uptrace is your single source of truth for monitoring, understanding, and optimizing complex distributed systems. Proven in production for over five years and trusted by more than a thousand installations worldwide, it lets you see your system like never before. What makes the difference is that Uptrace is pure OpenTelemetry, built natively from day one. This isn't a translation layer—it's a direct connection that eliminates friction and ensures zero vendor lock-in. Your homepage serves as your command center, providing complete visibility across your stack at a glance.

Your Next Observability RFP Is All Wrong: Why AI Changes Everything

Watch how AI is reshaping observability for the years ahead. In this fireside chat, Logz.io founders Tomer Levy and Asaf Yigal reveal how the most innovative AI-first companies are breaking free from dashboards, avoiding common RFP mistakes, and building future-ready stacks. You’ll see: Watch and learn how autonomous AI eliminates noise, slashes costs, and gives engineering teams back their velocity.

What does the EU Data Act mean for Observability?

The EU Data Act came into effect on January 12th, 2024 and most of its provisions apply from September 12th, 2025. The EU Data Act is designed to give individuals and businesses more control over the data they generate, ensuring fair access, use, and sharing across sectors. For any data generating platform that intends to operate in the European Union, this new legislation matters.

Observability and IT Monitoring Governance: Establishing Order (Part 3 of 4)

In our previous posts, we explored why robust IT monitoring governance is no longer a luxury but a strategic imperative. We highlighted how a disciplined framework prevents blind spots, reduces risk, and ensures the reliability and scalability of your critical business applications. But how do you translate these principles into practical, actionable governance within your IT environment?

Unlock Real-Time AWS Observability With Streaming Ingestion in DX Operational Observability

In fast-paced cloud environments, traditional monitoring methods often fall short. This leaves teams with latency and data gaps. It’s time to gain near real-time visibility into your AWS telemetry, enabling faster incident response and deeper insights. With its new streaming ingestion capabilities, DX Operational Observability (DX O2) is revolutionizing cloud monitoring—enabling teams to leverage AWS CloudWatch Metric Streams and Amazon Kinesis Data Firehose.

Calico Whisker vs. Traditional Observability: Why Context Matters in Kubernetes Networking

Are you tired of digging through cryptic logs to understand your Kubernetes network? In today’s fast-paced cloud environments, clear, real-time visibility isn’t a luxury, it’s a necessity. Traditional logging and metrics often fall short, leaving you without the context needed to troubleshoot effectively. That’s precisely what Calico Whisker’s recent launch (with Calico v3.30) aims to solve. This tool provides clarity where logs alone fall short.

Bridging the Gap Integrating Logs Metrics and Flow for Observability

In this video, we discuss handling both old and new systems in IT environments. From legacy SNMP setups to modern telemetry, most organizations juggle multiple data sources, which can make observability feel overwhelming. We explore how to combine logs, metrics, and flow data into one system that provides actionable insights. You’ll see practical examples of simplifying scattered tools and making sense of complex, disparate information. Understanding how these different types of data work together is key to getting observability right.

Smoother, smarter observability with the updated Site24x7 iOS 26

Enjoy improved control, clarity, and communication using the Site24x7 app on iOS 26. This update blends Apple's dynamic liquid glass design language with fast, secure, on-device AI summaries that help you observe your IT stack instantly and act decisively, from anywhere.

LangChain Observability: Monitoring Guide for Production Apps

LangChain applications fail differently than traditional web apps. A single user request can trigger 15+ LLM calls, cost $5 in tokens, and fail silently without throwing errors. One team discovered a $12,000 OpenAI bill caused by a recursive chain with no monitoring. This guide shows how to implement observability for LangChain applications, giving you complete visibility into performance, costs, and errors before they impact your users or budget.

Background Job Observability Beyond the Queue

Background jobs handle the critical work that happens outside the request path: processing payments, sending emails, generating reports, syncing data. They keep applications running smoothly, but the signals they produce look different from API endpoints. Most teams start with queue metrics—how many jobs are waiting and how quickly they complete. These metrics provide the foundation, but job health extends beyond throughput.

What is Service Catalog Observability and How Does It Work?

A service catalog gives teams a shared view of their systems—what services exist, who owns them, how dependencies are structured, and the SLAs that guide expectations. It’s an important part of development infrastructure because it helps everyone speak the same language about services. Service catalog observability builds on that foundation.

Introducing Cost Meter - Proactive Observability Cost Control with Per-Hour Granularity

The irony isn't lost on us - observability platforms are built to be proactive about system health, yet when it comes to managing observability costs themselves, teams are forced to be reactive. Today, that changes with Cost Meter, now live in our platform. Cost Meter transforms observability spend management from a monthly billing surprise into a proactive, data-driven process with hourly aggregated metrics that give you complete visibility into your telemetry ingestion patterns.

APM vs Observability: Observing beyond APM

In my previous post I made a bold, sweeping statement that APM is not - in the most specific sense - a subset of observability. Still standing by it I stand by that because words matter and - like many "monitoring engineers" (IT folks who make monitoring and observability their specialty) - I, too, bear scars from the flame-wars on Twitter back in the 2020's where we fought internecine battles over the proper definition of (and number of pillars in) “observability”.

Introducing Honeycomb Intelligence Canvas

Canvas is an AI-guided workspace inside Honeycomb that combines an AI assistant with an interactive notebook for visualizing query results and traces. You can ask a natural language question about your data and Canvas will immediately start exploring your traces, through multiple queries and other tools, to find the right next steps. Instead of having to write each query yourself, Canvas automatically proposes relational queries, comparisons, and visualizations that explain why an SLO fired or what changed after a deploy.

Pastries with SREs: Limitless observability and uncompromised donuts

In this episode of Pastries with SREs, we dig into Limitless Observability with a sweet side of unified observability strategy. If you're tired of siloed tools, fractured data, and swivel-chair investigations, this one’s for you. We explore: Why are silos still the norm in modern observability? What’s the true cost of inefficiencies across logs, metrics, and traces? How can SREs, IT operations, and dev teams shift to a no-compromise, unified observability model?

Meet Canvas: Your AI-guided Workspace Within Honeycomb

Modern systems are wonderfully capable, but relentlessly complex. Debugging across microservices, frontends, and cloud edges often means switching between five or more tools, trying to stitch together “what changed” and “why it broke.” Honeycomb’s wide events model has proven to be a superpower for taming that complexity, by allowing you to easily observe and query end-to-end traces without worrying about how much granular data you attach to your events.

Full-Stack Observability with VictoriaMetrics in the OTel Demo

The OpenTelemetry Astronomy Shop is a widely used demonstration environment designed to illustrate the concepts and practical implementation of observability in distributed systems. Built as a microservice-based e-commerce application, the demo provides developers with a near real-world environment where they can explore how telemetry data—metrics, logs, and traces—can be collected, processed, and visualized.

Introducing Anomaly Detection: Your Early Warning System for Service Health

Modern engineering teams face a persistent challenge: knowing when something goes wrong before their customers do. With microservices architectures sprawling across dozens or hundreds of services, creating comprehensive alerting becomes an overwhelming task. You're left playing whack-a-mole with manual alert configurations, often missing critical issues or drowning in false positives.

Visually identify observability gaps with Cloudcraft in Datadog

Modern cloud environments are highly complex and dynamic, with critical services relying on large numbers of ephemeral resources. Ensuring observability coverage across this landscape is essential for troubleshooting, maintaining reliability, optimizing performance, and enforcing security standards. But as environments grow more elaborate and their ownership more dispersed, tracking observability coverage becomes increasingly challenging.

Visualize Logs Alongside Metrics: Complete Observability Elasticsearch Performance

Elasticsearch is a distributed search and analytics engine that powers everything from log management platforms to e-commerce search bars. It excels at indexing and retrieving large volumes of data quickly, but like any complex system it can slow down under heavy load or inefficient queries.

Introducing Honeycomb Intelligence Anomaly Detection

Modern teams face a persistent challenge: knowing when something goes wrong before their customers do. With architectures sprawling across dozens or hundreds of services, creating comprehensive alerting becomes an overwhelming task. You're left playing whack-a-mole with manual alert configurations, often missing critical issues or drowning in false positives. Today, we're excited to announce our solution to this challenge: Anomaly Detection (currently in alpha), Honeycomb's proactive approach to understanding and acting on service health.

Introducing Honeycomb Intelligence MCP Server - Now GA!

In the months since we launched our public beta, we’ve been hard at work making Honeycomb MCP more useful and capable for agents and human operators alike. Our goal with this project has been, from the start, to allow AI to engage in the same kind of investigatory loops that we guide users towards. Many of the new features are designed expressly with this in mind, the most exciting of which is BubbleUp, now available in.

Honeycomb MCP Is Now In GA With Support for BubbleUp, Heatmaps, and Histograms

If you’ve been following my public journey with LLMs this year, it probably won’t surprise you to learn that this blog post is an announcement about the general availability of Honeycomb’s hosted MCP server. I want to share a few updates about what’s new in the GA release, discuss some interesting learnings from building it, and share examples of how we’re using MCP internally. First: if you're still in the dark about MCP and AI agents, go read the earlier blogs I linked.

Observability and Monitoring Governance (Part 1 of 4)

In contrast to the many flavors of governance used for IT, such as data governance, audit and compliance, and governance and security, IT monitoring governance lacks a definition in many organizations. This is true even as teams have decades of experience monitoring the health, performance, and availability of applications, infrastructures, networks, and user experience. Good monitoring governance “just sort of happens—naturally, organically.” Not exactly!

Observability Journey Panel - Dell x TekStream

Join Dell Technologies, TekStream Solutions, and Grafana Labs for a candid panel on scalining observability. Learn how enterprise teams scale observability, balance centralized vs. decentralized models, and accelerate adoption. The panel explores challenges with culture, governance, tool sprawl, and how AI is reshaping monitoring and incident response.

Software-Defined Healthcare: Modernizing Through DevOps, Observability & AIOps

Healthcare delivery is undergoing a transformation unlike any other. Digital systems now shape how physicians deliver care, how practices are managed, and how patients experience the health system. From cloud-native platforms to intelligent automation, the shift toward software-defined healthcare is revolutionizing clinical operations. At the heart of this change are three critical enablers: DevOps, Observability, and AIOps. Together, they form the backbone of a modern healthcare IT environment, driving resilience, agility, and patient-centered outcomes.

How Teams Are Using AI to Tackle Observability Challenges (2025 Survey Insights) | Grafana Labs

In Grafana’s 3rd annual Observability Survey, over 1,000 engineers and leaders shared their challenges — tool sprawl, complexity, rising costs, and nonstop alerts — and their hopes for how AI can help.

SvelteKit observability just got 10x better, and we're here for it

The Svelte Team recently announced full observability and tracing support for SvelteKit! This is great news for SvelteKit and Sentry users, since Sentry is already compatible with the new feature! In addition, this is even greater news for the JavaScript ecosystem as a whole because SvelteKit just became the first ESM-based meta-framework to support instrumentation and tracing out of the box.

Sharpening My React Hooks Knowledge With ChatGPT

I’m a product engineer at Honeycomb. While my work spans the stack, I’m currently focused on deepening my frontend expertise. To support this, I’ve been using ChatGPT as a study assistant. It’s helped me break down complex topics with clear explanations, real-world examples, and—critically—interactive practice. The most effective formats I’ve found.

The Fourth Pillar of Observability

Your application is only as reliable as the infrastructure it runs on. Most commonly, that means Kubernetes is doing the job by managing fleets of containers, scaling services on demand, and keeping workloads distributed across nodes. Traditional dashboards weren’t built to scale with this reality. They give you snapshots of raw metrics. They don’t scale to multi-cluster environments. They don’t map relationships between resources.

Bridging the Gap: Legacy Systems and Modern Observability

Technology moves quickly and while the spotlight has shifted to dynamic, cloud-based systems, many organizations have legacy applications and infrastructure that they must maintain. In this fireside chat, Datadog’s Matt Moore (Principal Observability Strategist) will host James Flores (Enterprise Systems Engineer) at Australian Community Media to discuss their journey of modernization and bridging legacy systems with the cloud using a bit of ingenuity and observability.

Bringing Observability to Claude Code: OpenTelemetry in Action

AI coding assistants like Claude Code are becoming core parts of modern development workflows. But as with any powerful tool, the question quickly arises: how do we measure and monitor its usage? Without proper visibility, it’s hard to understand adoption, performance, and the real value Claude brings to engineering teams. For leaders and platform engineers, that lack of observability can mean flying blind when it comes to understanding ROI, productivity gains, or system reliability.

Actionable insights into the end-user experience: an overview of Grafana Cloud Frontend Observability dashboards

One of the biggest challenges in frontend development is identifying when and why users encounter performance issues, whether it’s slow page loads, JavaScript errors, or failed HTTP requests. With Grafana Cloud Frontend Observability — a hosted service for real user monitoring (RUM) — you get immediate, clear, and actionable insights into the end-user experience of your web applications.