Operations | Monitoring | ITSM | DevOps | Cloud

Built for Engineers: Datadog's Vision for the Future

Datadog was built by engineers, for engineers. At, Datadog Co-founder & CEO Olivier Pomel opened the keynote with a clear message: observability, security and AI are converging. From infrastructure to AI Agents, the future of engineering requires one unified platform. Catch all product announcements to see what’s next in observability and security on our Youtube channel!

How we've created a successful FinOps practice at Datadog

When you adopt FinOps to maximize the value of your cloud spending, you may have some simple first steps you can take to gain cost efficiency. For example, you can find and delete any unused resources to quickly realize a one-time optimization. But the ongoing work to manage cloud costs becomes complex as your organization grows, your infrastructure spans multiple clouds, and you can't easily see the full value of your cloud spending by tracking only the bottom line.

Route your monitor alerts with Datadog monitor notification rules

As organizations scale their infrastructure, monitoring systems can become a source of noise rather than insight. A clean, straightforward set of alerts for a handful of services can quickly spiral into a mess of overlapping thresholds, redundant triggers, and inconsequential notifications across hundreds (or thousands) of components. This flood of notifications can slow response times, overwhelm engineers, and increase the chance of overlooking critical problems.

Improve SLO accuracy and performance with Datadog Synthetic Monitoring

SLOs are key for improving user satisfaction, prioritizing engineering projects, and measuring overall performance. Given the important role that SLOs play in determining organizational benchmarks, teams need to ensure that SLO metrics—also called service level indicators (SLIs)—are reported accurately and maintained consistently within an acceptable range.

Trace Distributed Map states for AWS Step Functions with Datadog

AWS Step Functions offers the Distributed Map state, enabling you to coordinate massively parallel workloads within your serverless applications. With this feature, a single Step Functions execution can fan out into up to 10,000 parallel workflows simultaneously, making it possible to efficiently process millions of items in parallel. This capability unlocks new possibilities for large-scale data processing, such as image transformation, log ingestion, or batch analytics.

How Cursor scaled infrastructure rapidly and reliably using Datadog

At Datadog, we use Cursor to empower our teams to build more quickly. And we know that building and troubleshooting with AI tools like Cursor is done best with the right observability data and context. Discover how Cursor was able to rapidly and reliably scale their infrastructure 100x using Datadog to meet the needs of a fast growing user base. And learn more about how we’re bring Datadog tools and context to your favorite AI IDEs and agents with our MCP Server and extensions.

Stay Compliant: Meet Your Audit Needs with Datadog!

Datadog's internal compliance team has built audit workflows and control monitoring capabilities using the Datadog platform. We actively use these capabilities to scale our audit programs and comply with multiple compliance frameworks. This session will go into the details of how we addressed our compliance use-cases using the Datadog platform and how our customers can get started.

AI-Augmented Control Plane: Scaling IT Operations with Intelligent Automation

How do you enable a team of 100 engineers to effectively support 300+ critical applications across five hosting platforms? At Thomson Reuters, we turned to AI - not as a buzzword, but as a genuine force multiplier. Experience our journey of transforming traditional IT operations into an AI-augmented powerhouse, where Datadog, ServiceNow, and custom AI solutions work in harmony to create a next-generation control plane. We'll share real victories, honest challenges, and practical insights from our mission to build a more intelligent operational framework.

LLM Observability for Reliability and Stability: A Monitoring Strategy for Phone Communication

LLM APIs offer groundbreaking potential, but also present challenges such as response latency, hallucinations, and service instability. In Japan, where telephone communication remains crucial for business, these issues present significant barriers to the introduction of LLM-based applications. Despite being a relatively young startup, we have developed and deployed an LLM-based telephone service with over 40 million calls.

Datadog + OpenAI: Codex CLI integration for AIassisted DevOps

We are exploring how we can help on-call engineers troubleshoot incidents more effectively by providing the OpenAI Codex agent with access to real-time observability data in terminals. We've developed an integration and new tool visualizations that connect OpenAI's Codex CLI to the new Datadog MCP server. In this post, we'll share what we've been experimenting with: enabling an AI agent to retrieve production metrics, logs, and incidents from Datadog in real time and act on that context.

Optimize and troubleshoot AI infrastructure with Datadog GPU Monitoring

As organizations bring more AI and LLM workloads into production, the underlying GPU infrastructure that supports these workloads becomes even more critical in ensuring these workloads remain fast, reliable, and scalable. Inefficient GPU resource usage, for instance, can lead to longer runtimes and reduced throughput, negatively impacting overall model performance. Additionally, idle and underutilized GPUs can quickly drive up costs and lead to needless spending.

Datadog MCP Server: Connect your AI agents to Datadog tools and context

As development teams adopt AI-powered tools and build services that make use of AI agents, they want to extend their AI capabilities to incorporate familiar tools and observability data. However, AI agents struggle with regular API endpoints and frequently fail when parsing complex nested JSON hierarchies or incorrectly handling errors. As a result, these agents often fail to retrieve relevant results.

DASH by Datadog 2025 Keynote

At the 2025 DASH Keynote and be the first to experience Datadog's latest product innovations. This year, we're unveiling next-generation observability features, innovative ways to secure your AI workloads, and powerful agentic AI capabilities throughout the Datadog platform. Discover the new ways your teams can observe, secure, and act in the age of AI.

Automatically identify issues and generate fixes with the Bits AI Dev Agent

Developers lose hours each week to a familiar troubleshooting loop: chase down telemetry across dashboards, decipher vague errors, and juggle alerts to find the signal worth fixing. Production issues, performance regressions, and security vulnerabilities all demand attention, but they often come with little context for taking action.

Improve performance and reliability with Proactive App Recommendations

As your organization grows, you may operate in increasingly complex environments and manage more services and larger teams to maintain them. Evolution like this can lead to an explosion of telemetry data from across your stack, including metrics, traces, logs, and frontend interactions. The benefit of greater visibility is often outweighed by the challenge of acting on the data you collect, and you can easily fall behind on implementing the fixes your services require to operate reliably and efficiently.

Ensure trust across the entire data life cycle with Datadog Data Observability

As data systems grow more complex and data becomes even more business-critical, teams struggle to detect and resolve issues that impact data quality, reliability, and, ultimately, trust. Engineers have to rely on manual checks and ad hoc SQL queries to catch data quality issues—often after teams relying on the data have noticed something has gone wrong.

Accelerate Oracle Cloud Infrastructure monitoring with Datadog OCI QuickStart

Datadog’s Oracle Cloud Infrastructure integration enables you to collect metrics and logs from your entire OCI stack and monitor them within a single platform alongside other third-party technologies. Datadog’s new OCI QuickStart is a fully managed, single-flow setup experience that helps you monitor your OCI infrastructure and applications in just a few clicks.

Create and monitor LLM experiments with Datadog

To efficiently optimize your LLM application before pushing to production, you need a comprehensive testing and evaluation framework. By running experiments, you can optimize prompts, fine-tune temperature and other key parameters, test complex agent architectures, and understand how your application may respond to atypical, complex, or adversarial inputs. However, it can be difficult to manage your experiment runs and aggregate the results for meaningful analysis.

Introducing Bits AI SRE, your AI on-call teammate

Getting paged pulls engineers away from meaningful work, yet incident response in many organizations remains manual, reactive, and draining. An alert fires and teams scramble to find the root cause, relying on siloed knowledge, incomplete context, and a few on-call experts who are already stretched thin. The rise of AI coding agents has only intensified this challenge: As teams ship code faster with less human oversight, production systems grow increasingly complex and harder to understand.

Migrate historical logs from Splunk and Elasticsearch using Observability Pipelines

Migrating to a new logging platform can be a complex operation, especially when it involves both active and historical logs. Observability Pipelines offers dual-shipping capability, making it easy to route active logs to your new platform without disrupting your log management workflows. But migrating years worth of historical logs—which are critical for investigating security incidents and demonstrating compliance with applicable laws—requires a different approach.

Create rich, up-to-date visualizations of your AWS infrastructure with Cloudcraft in Datadog

As your cloud environment grows more complex and dynamic, it becomes more difficult to maintain up-to-date reference diagrams, visualizing its components, that are available to all teams. As a result, teams often end up lacking the visibility they need to understand, manage, and troubleshoot their cloud infrastructure and applications.

Announcing Go tracer v2.0.0

Datadog has long supported the monitoring of instrumented Go applications through our Go tracer v1. As the Go ecosystem has continued to mature, we’ve been hard at work collecting feedback and improving upon the tracer’s capabilities and usability features. We are now thrilled to announce the release of our Go tracer v2.0.0. This major update includes better security and stability, and a new and simplified API.

Monitor OpenTelemetry-native metrics with Datadog

OpenTelemetry (OTel) is emerging as the industry standard for collecting and transmitting observability data. Datadog supports several ways to send and accept OTel-native data, while also continuing to support its own native telemetry format. To provide a consistent monitoring experience, Datadog now supports using OTel-native metrics alongside Datadog-native metrics across dashboards, queries, and core visualizations in the Datadog platform.

Best practices for end-to-end custom metrics governance

Custom metrics enable you to track what matters to your distinct business and services and correlate it with the rest of your telemetry data. As your organization grows by adding more teams, services, and environments, your volume of custom metrics can grow with it. To ensure critical visibility while maintaining cost efficiency, organizations need an end-to-end approach to custom metrics governance.

Introducing RUM without Limits: Capture everything, keep what matters

Real User Monitoring (RUM) helps teams understand exactly how their users experience their web and mobile applications—from load times to crashes and frustration signals. But traditional RUM models come with tough trade-offs: capture all sessions and overspend, or sample data and miss what matters. Fixed sampling rates may help manage volume, but they leave dangerous blind spots.