Operations | Monitoring | ITSM | DevOps | Cloud

This Month in Datadog - July 2025

In July’s episode of This Month in Datadog, we’re doing things differently by spotlighting the people behind the products you rely on. Jeremy is joined by Tristan Ratchford to discuss saving time and effort when you’re on call with Bits AI SRE, and by Kevin Hu to explore gaining visibility into datasets across the entire data lifecycle with Data Observability.

Out-of-the-box Alerting for Frontend Observability in Grafana Cloud

Get alerted on frontend issues the moment they happen — no setup headaches required. In this short demo, Elliot Kirk from Grafana Labs introduces out-of-the-box alerting for frontend observability. Whether you're tracking error counts or web vitals, this new feature makes it easy to stay ahead of performance issues. With just a few clicks, you can: Enable prebuilt alerts for your apps Visualize and edit alerts directly in the UI Customize thresholds and durations Set up notifications and stay in the loop Launch alerting with every new app setup.

Bring high-performance observability to secure Kubernetes environments with Datadog's new CSI driver

In Kubernetes environments, applications often communicate with the Datadog Agent to send telemetry data such as custom metrics via DogStatsD or traces through Datadog APM. How this communication takes place depends on the communication mode set on the Datadog Cluster Agent's Admission Controller. With the sockets option, communication takes place through local inter-process communication via Unix domain sockets (UDS), whereas the service and default hostip options rely on network communication.

Integrating CI/CD Pipelines with Observability Tools

CI/CD pipelines are automated workflows that take code from development to production. The CI/CD pipeline meaning encompasses two key practices: A typical CI/CD pipeline includes stages like code compilation, testing, security scanning, artifact creation, and deployment across multiple environments.

Why Observability Isn't Just for SREs (and How Devs Can Get Started)

Almost every other day, when I scroll past r/devops or r/sre, I see a post like this asking how a dev can get started with devops, observability, etc. Sample Reddit thread on how to get started with OTel This blog is an attempt for anyone lost to find their way into observability and a wake-up call for devs to they should think about observability more actively today than ever before. A dev’s observability playbook.

This Month in Datadog: Bits AI SRE, Datadog Data Observability, and more

Datadog is constantly elevating the approach to cloud monitoring and security. This Month in Datadog updates you on our newest product features, announcements, resources, and events. To learn more about Datadog and start a free 14-day trial, visit Cloud Monitoring as a Service | Datadog. This month, we chat with two guests about Bits AI SRE and Datadog Data Observability.

What is Grafana Cloud? Fully Managed Observability Built on Open Standards | Grafana Labs

Grafana Cloud helps teams detect, investigate, and resolve incidents faster—thanks to AI, open standards, and seamless integrations with OpenTelemetry, Prometheus, Salesforce, and more. See how it all works in this live demo of a simulated e-commerce outage.

Disposable Code Is Here to Stay, but Durable Code Is What Runs the World

Every day I seem to run into yet another post with someone solemnly opining that “writing code has never been the hardest part of software engineering. And hey, that’s smashing. As an engineer from the ops/infra/SRE side of the house, I feel like I’ve been saying this my whole career. (Is there anything more satisfying than being proven right in public? Not in my book.) So, which is it?

Data Observability: Build confidence in the data life cycle

Datadog Data Observability provides a complete solution with quality checks (e.g., volume, row changes, freshness), custom SQL-based monitors, anomaly detection, column-level lineage across systems like Snowflake and Tableau, full pipeline visibility, and targeted alerts when data issues arise.

How to monitor and manage front-end observability in Blackfire

In this video, we'll guide you through the process of monitoring and managing your usage of front-end observability features in Blackfire. Learn how to access your Browser usage dashboard to view browser traces collected per environment, track your quota consumption, and understand the concept of spike protection. You'll discover how Blackfire's automatic detection of abnormal traffic spikes protects your monthly quota and ensures continuous data collection.

How to Enable and Configure Front-end Observability in Blackfire

In this video, learn how to enable and configure Front-end Observability in Blackfire. The tutorial covers steps to enable features across multiple environments via the Organization settings / Front-end usage in the Blackfire dashboard. Control front-end observability by enabling or disabling Browser Monitoring and Analytics per environment, using a JavaScript probe and a unique browser key. The video emphasizes the importance of naming transactions and explains how to manually add tracking snippets to HTML for better control.

Unifying Observability: Intelligence, Automation, and Insights in Action

As enterprise IT environments evolve into ever-greater complexity and scale, demands on operations teams are accelerating. In the traditional model, observability tools collect data, engineers manually correlate events, and remediation follows a ticketing trail. However, that approach no longer matches the speed and scale of today’s digital businesses. Even the most storied dashboards can’t address today’s operational needs.

How I Use GenAI as a Thought Partner, Not a Shortcut

You don’t need to be a power user to get powerful results. I’m not training models or prompting GPTs into poetry—I’m just using them to do what great managers already try to do: communicate clearly, prioritize outcomes, and lead with intention. Over the last few quarters, I’ve built a handful of custom GPTs to support my weekly, monthly, and quarterly workflows.

Why continuous profiling is the fourth pillar of observability

Developers have long used profilers to diagnose performance bottlenecks and improve the efficiency of their code. But a modern version of profiling, continuous profiling, is quietly redefining what profiling is and what it can do. By running nonstop in production with very low overhead, continuous profilers give teams always-on visibility into how their code behaves in the real world.

Observability Data: Ingestion Pipeline Best Practices

Great data is a prerequisite to all things AIOps and observability. Great observability data results in fewer observability gaps, better analysis and insights, and more confidence within teams that rely on the power of modern AIOps and observability technologies. Goals for improved automation, IT efficiencies, intelligent triage and remediation all become more achievable with better data.

Tutorial: Visualize Your Puppet Data in Grafana with the Observability Data Connector

When you manage complex IT infrastructure, it becomes critical to use tooling to understand what’s happening across all of your systems in terms of performance, reliability, and compliance. Monitoring key indicators manually is simply no longer possible at that scale. Puppet has long been known as a solution for managing large environments and collecting a vast amount of data about your infrastructure, but accessing and visualizing that data in a meaningful way can be a challenge.

AWS Summit NYC 2025: Laser-Focused on AI

If you’re unfamiliar with AWS Summits, these are conferences that occur on a yearly basis in different cities. The events are mostly used to announce new products and technologies. This year, the theme was AI, as evidenced by the keynote, a large majority of the talks, and a walk around the vendor floor. The keynote talk was hosted by Swami Sivasubramanian, VP of Agentic AI at AWS.

How SAP achieved world-class uptime through modern observability

SAP Customer Experience (CX) has undergone a remarkable transformation over recent years, evolving from fragmented monitoring to a scalable, automated observability powerhouse. In a recent fireside chat, Martin Norato Auer, SAP CX’s VP of Observability, shed light on the strategies, practices, and measurable impacts behind SAP’s SLA, uptime, and responsiveness achievements.

Architecting for Value: A Playbook for Sustainable Observability

You’ve built something amazing. Your services are scaling, your users are happy, and your team is shipping code like never before. Then the cloud bill arrives, and one line item makes your eyes water: observability. That Datadog invoice feels less like a utility bill and more like a ransom note. It’s a modern engineering paradox. The tools that give you sight into your complex systems are the same ones that can blind you with runaway costs.

Ship Confluent Cloud Observability in Minutes

You're running Kafka on Confluent Cloud. You care about lag, throughput, retries, and replication. But where do you see those metrics? Confluent gives you metrics, sure, but not all in one place. Some live behind a metrics API, others behind Connect clusters or Schema Registries. You either wire them manually or give up. What if you could stream those metrics to a platform built for high-frequency, high-cardinality time series, and do it in minutes?

How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines

Platform teams are struggling with observability noise, bloated storage costs, and lack of clarity during incidents. Most teams capture everything all the time, leading to expensive, overwhelming, and often unnecessary data volumes. In Telemetry for Modern Apps, Mezmo teamed up with Checkly to demonstrate how synthetic monitoring triggers and responsive telemetry pipelines can help reduce costs while maintaining the context needed during incidents.
Sponsored Post

Streamlining multi-cloud complexity with unified observability

A wave of businesses are embracing multi-cloud strategies to gain flexibility and scalability. By combining on-premises infrastructure, private clouds, and public platforms like AWS, Azure, and Google Cloud Platform (GCP), IT teams can experiment, deploy, transform, and improve their IT applications significantly. On the down side, this modern IT approach of employing multiple clouds (in both public and private forms) also brings significant complexity, making it challenging to monitor systems, control costs, and secure environments. There are just too many threads to track and tie together to ensure a taut IT fabric.

Will AI Speed Development in Your Legacy App?

Some people can get an AI assistant to write a day’s worth of useful code in ten minutes. Others among us can only watch it crank out hundreds of lines of crap that never works. What’s the difference? There are some skills specific to AI development. There are also properties of the codebase we’re working in that make it amenable to AI assistance. Most AI demos use projects created from scratch with AI in mind—cute.

I built an MCP Server for Observability. This is my Unhyped Take

Recently, I read a blog titled “It’s The End Of Observability As We Know It (And I Feel Fine)”, which discussed MCP servers in observability and how these systems would potentially be the “end of observability”. As someone who has spun up an MCP server for an observability backend and as someone who has been in the space for a while, I certainly do not think so.

Cloud or Self-Hosted - Which Deployment Model is Right For You?

Choosing the right observability platform is a critical decision. But how you deploy it is just as important. The right deployment strategy can accelerate your team, simplify operations, and ensure you meet compliance and security requirements. The wrong one can lead to operational headaches and slow you down. At SigNoz, we believe in flexibility. There is no single "best" way to deploy an observability platform; there's only the way that's best for you.

Honeycomb Named a Visionary in the 2025 Gartner Magic Quadrant for Observability Platforms

In the era of AI, software development is at an inflection point, and observability has never been more critical. Teams are dealing with more code, more data, and more pressure than ever before. To navigate these new challenges, you need a partner with a strong vision for the future and a knack for looking around corners. Honeycomb is proud to be named a Visionary in the 2025 Gartner Magic Quadrant for Observability Platforms.

Honeycomb In Your IDE? Yes, With Hosted MCP Now Available in AWS Marketplace AI Agents and Tools Category

I’m pleased to announce the public beta of Honeycomb Hosted MCP, along with our first wave of one-click integrations for Cursor, Visual Studio Code, and Claude Desktop. We’re also very excited to announce that Hosted MCP is available on AWS AI Agents marketplace and for all Honeycomb plans (including our free plan!) at no charge. Honeycomb was built with a singular focus: how do we help teams become better at the art and craft of software development, delivery, and operations?

ITRS named in Gartner Magic Quadrant for Observability Platforms

When Uptrends became part of ITRS, we knew we were joining a team deeply committed to innovation, precision, and people — whether those people were troubleshooting transaction journeys from their laptops at 8am or keeping enterprise-scale operations online 24x7. We’ve come far since then.

ScienceLogic Named a Visionary in the 2025 Gartner Magic Quadrant for Observability Platforms

It’s official: ScienceLogic has entered the observability arena. Named a Visionary in the 2025 Gartner Magic Quadrant for Observability Platforms, we believe we’re helping define where observability is heading, not just where it’s been. This marks our first inclusion in this Magic Quadrant and, in our opinion, validates our mission to redefine intelligent, actionable observability in the era of AI and automation.

Kubernetes Monitoring backend 2.2: better cluster observability through new alert and recording rules

We’re excited to announce version 2.2.0 of the backend for our Kubernetes Monitoring solution in Grafana Cloud is now available. The app’s backend is supported by kubernetes-mixin, an open source Prometheus Monitoring Mixin, and this latest version features significant improvements to alert rules and recording rules that will enhance your cluster observability and monitoring experience. There’s a lot to tell you about, so let’s dive in.

Monitor agents built on Amazon Bedrock with Datadog LLM Observability

As large language models (LLMs) grow more powerful, organizations are deploying agentic AI applications to tackle complex, multi-step tasks. With Amazon Bedrock Agents, developers can orchestrate these agents to manage tasks such as triggering serverless functions, calling APIs, accessing knowledge bases, and maintaining contextual conversations—all while breaking down complex user requests or tasks into manageable steps.

How to Troubleshoot Outages Faster Using Elastic Observability [2 Min Live Demo]

In this video, I’ll show you how Elastic Observability helps you reduce downtime, accelerate root cause analysis, and unify logs, metrics, and traces in one powerful dashboard. With native OpenTelemetry support, AI-powered troubleshooting, and built-in anomaly detection, you can streamline your workflows and boost service reliability.

Arie's Adventures with Coroot

Arie van den Heuvel is an engineer, a System and Application Management Specialist, and a valued member of our community. Below he has shared his journey using Coroot, and how it has helped improve observability for his team. You can read more of Arie’s writing and support the resource articles he has created for open source on his blog.

Splunk Named a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms

We are proud to announce that Splunk has been named a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms for the third year in a row. In our opinion, our recognition in the Observability category comes on the heels of Splunk being recognized for a tenth consecutive time as a Leader in the 2024 Gartner Magic Quadrant for Security Information and Event Management (SIEM). Splunk was the only vendor named a Leader in both SIEM and Observability for the Gartner Magic Quadrant three times.

Climbing the Security Pyramid: From Awareness to Automation with AI and Observability

Modern threats don't wait. They move fast, hide deep, and often strike without warning. That's why old-school security isn't enough anymore. You need more than firewalls and login rules. You need layers. You need clarity. And most of all, you need speed. This is where the security pyramid comes in. It shows how smart security stacks-from the ground up. It starts with awareness and ends with advanced tools like automation and AI. In this article, we'll break it down step by step-and show how observability and automation help you climb it.

The Fast Path to More Useful Telemetry

Over and over, we’ve seen that teams who invest in adding rich, relevant context to their telemetry end up debugging faster and collaborating more effectively during incidents. Getting meaningful context added can feel like a big cross-team project, but some of the highest-leverage improvements don’t require app code changes or coordination across services.

Observability as Code: Why You Should You Use OaC

Key takeaways In the fast-moving world of CI/CD pipelines, microservice architectures, and container orchestration, software changes rapidly. What exists in a codebase today might be gone next week. At this scale and speed, it’s impossible for development teams to manually track every line of code and every new piece of functionality.

Uptrace v2.0: The Future of Observability is Here

The Uptrace team is thrilled to announce the release of v2.0—our biggest update yet! This release represents a complete reimagining of how observability data should be stored, queried, and managed. With multi-project support, revolutionary JSON-based storage, powerful data transformations, and a host of developer-friendly features, Uptrace v2.0 is designed to scale with your growing infrastructure needs.

What Is Hybrid Observability? A Healthcare IT Explainer

Healthcare IT environments have become incredibly complex. Think about everything running simultaneously in your organization: physical medical devices, cloud platforms, clinical applications like Epic, and patient-facing applications. Each component needs to work together seamlessly, much like how ICU monitors track multiple vital signs at once. Many healthcare organizations still use monitoring solutions designed for simpler times, when systems were more isolated.

Grafana Labs named a Leader again in the 2025 Gartner Magic Quadrant for Observability Platforms

We’re thrilled to share that Grafana Labs has been recognized as a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms—for the second year in a row. This year’s report placed Grafana Labs furthest in “Completeness of Vision,” which we believe reflects our deep commitment to building a truly open, composable observability stack that gives users flexibility, control, and the tools to own their observability strategy.

Elastic named a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms

Observability has an investigation problem, and dashboards and alerts aren’t enough for solving problems in today’s complex systems. AI-driven capabilities, powerful analytics, and the ability to scale are essential to drive real-time investigations while keeping costs low. We think this is why Elastic has been named a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms for the second time.

How to improve your observability

Coroot was designed to solve the problem of time-consuming root cause analysis. It handles the full observability journey - from collecting telemetry automatically with zero code setup (thanks, eBPF!) to simplifying the role of SREs and DevOps everywhere with instant root cause analysis powered by AI. We also strongly believe that simple observability should be an innovation everyone can afford to benefit from: which is why our software is open source!

Datadog named Leader in 2025 Gartner Magic Quadrant for Observability Platforms

We are thrilled to announce that, for the fifth consecutive year, Datadog has been named a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms. We believe that this recognition reflects our continued focus on helping customers observe, secure, and act on everything that matters across their technology stack.

What Are Traces? A Developer's Guide to Distributed Tracing

One of the most common challenges in modern software engineering today is understanding how requests flow through applications. As system architectures shift to favor widely distributed, cloud-native designs, keeping track of how an application processes user actions is more difficult than ever. A single user action may trigger events processed in dozens of backend services. Traces are helping software developers today with this challenge.

The Inconvenient Truth About AI Ethics in Observability

Let's be honest: most conversations about AI ethics sound like they're happening in a boardroom, not an ops room. But here's the thing, when you're using AI to make sense of your telemetry data, ethics isn't some abstract concept. It's the difference between insights you can trust and algorithmic noise that leads you down the wrong path. The uncomfortable reality? Your AI is only as ethical as the messiest, most biased piece of telemetry data you feed it. And if you think your data is clean, well...

Grafana Labs is a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms

For the second year in a row, Grafana Labs has been named a Leader in the Gartner Magic Quadrant for Observability Platforms — and this year, we’re proud to be recognized as the furthest in Completeness of Vision. In this video, Grafana Labs CTO Tom Wilkie shares what this recognition means, why our scores for execution and vision both improved, and how it reflects years of building a truly open, composable observability stack.

Coralogix | Magic Quadrant 2025

Today marks an exciting moment for all of us at Coralogix. We’re proud to share that Gartner has named us a Visionary in the 2025 Magic Quadrant for Observability Platforms. This recognition, we believe, reflects what we’ve been building toward for years: an observability platform that delivers scale, cost-efficiency, AI-powered insights, and tangible customer success.

Honeycomb Users Are Living in the Future, Part 1: Sampling

When we talk to new Honeycomb users, a few things stand out as sounding downright magical. Sometimes we’ll hear, “Wow, is that a new feature?” and we’ll say that no, it’s been like that for years. Clearly we need to get the word out! This is the first installment of a blog series I’ll be writing, covering areas of Honeycomb that elicit reactions of awe and disbelief from new users.

Lumigo Launches AI Agent Observability

LLM-powered agents are reshaping software, but when they fail, troubleshooting is guesswork. Lumigo’s new AI Agent Observability, now in beta, gives you visibility into the entire lifecycle of your agents, from prompt to response to internal decision logic. Built for modern AI workloads, this feature is designed to help engineers monitor, debug, and optimize agents running on platforms like OpenAI, Anthropic, and open-source models.

Observability for containerized workloads: How to run Grafana Beyla as a sidecar in Amazon ECS

Note: Grafana Beyla has been donated to OpenTelemetry under the new project name OpenTelemetry eBPF Instrumentation. Beyla will continue to exist as Grafana Labs’ distribution of the upstream project. Grafana Beyla is an open source eBPF-based auto-instrumentation tool that helps you easily get started with application observability, allowing you to monitor and visualize traces without modifying the application code.

Observability in under 5 seconds: Reflecting on a year of grafana/otel-lgtm

With grafana/otel-lgtm, observability is just one Docker command away. Over the past year, grafana/otel-lgtm has simplified observability setups, helping developers get a complete OpenTelemetry stack running in under five seconds. With integrations for metrics, logs, traces, and now profiles via Grafana Pyroscope, it has become a go-to solution for demos, development, and testing, as evidenced by its growing community (1k stars on GitHub and growing!) and notable adopters.

Monitoring & Observability Report Top Findings

Today, BigPanda released our first-ever research report based on data gathered from our agentic IT operations platform. Our Monitoring and Observability Tool Effectiveness for IT Event Management report provides insights and benchmarks on incident detection and noise reduction for 130 enterprise organizations, including the monitoring and observability data sources integrated with BigPanda.

How to Simplify AI Observability Across Hybrid and Cloud Environments

As companies adopt more artificial intelligence (AI) to stay competitive and simplify operations, they’re hitting a snag they’ve seen plenty of times before: complexity. Those user-friendly chatbots and impressive predictive models aren’t magic—they run on powerful GPUs like NVIDIA’s and rely on cloud services such as Azure OpenAI or Amazon SageMaker.

Observability isn't about the tool. It's about the truth

An enterprise client reports latency. Your dashboards say everything is fine. They blame you. You blame them. Nobody can prove it either way. This is where most monitoring efforts hit a wall. Too often, the conversation gets stuck on dashboards and tools instead of the one thing that really matters: truth. Observability isn’t about collecting metrics or building pretty dashboards.

LangChain Observability: From Zero to Production in 10 Minutes

LangChain apps are powerful, but they’re not easy to monitor. A single request might pass through an LLM, a vector store, external APIs, and a custom chain of tools. And when something slows down or silently fails, debugging is often guesswork. In one instance, a developer ended up with an unexpected $30,000 OpenAI bill, with no visibility into what triggered it. This blog shows how to avoid that using OpenTelemetry and LangSmith. With this setup, you’ll be able to.

Netdata: The Fastest Path to Full Stack Observability. AI Powered.

Netdata is a real-time, high-performance and on-premises observability platform designed to monitor metrics and logs with unparalleled efficiency. Netdata requires zero-configuration to get started, and provides alerts, anomaly detection and AI assisted troubleshooting out of the box, providing a powerful and comprehensive infrastructure monitoring experience. Netdata is known for its distributed design. Instead of funneling all data into a few central databases like most traditional monitoring solutions, Netdata processes data at the edge, keeping it close to the source.

MCP Observability with OpenTelemetry

2025 has truly been the year of Agentic AI, with MCP (Model Context Protocol) emerging as one of its flashy and most talked-about innovations. While many products have seamlessly integrated MCP servers into their systems, these servers are increasingly being labelled as black boxes, opaque components that handle critical tasks but offer little visibility into what's happening under the hood. We prompt an agent, a tool gets invoked, and a response is generated. But what really happens in between?

Can Claude Code Observe Its Own Code?

One of the great things about OpenTelemetry is that it’s a standard, and standards tend to proliferate. I was excited to see Claude Code add OpenTelemetry metric and log support in a recent release. What was really interesting—beyond the ability to capture usage data from Claude Code—is that you can also get pretty detailed logs about what you’re doing with Claude Code.

Why GovRAMP-authorized observability matters for state, local, and education IT teams

Building on our FedRAMP Moderate authorization and our “In Process” status for FedRAMP High, Datadog for Government is now "In Process" for GovRAMP High Authorization, giving agencies a unified observability platform that meets the toughest public-sector security bars.