Operations | Monitoring | ITSM | DevOps | Cloud

How Datadog uses AI to build internal software delivery tools and improve system performance

At Datadog, we want our developers to become better at using AI tools with the end goal of building quality software, faster, that generates real value. This includes not only the products and features that our customers use, but also the internal tools that help keep our workflows running smoothly behind the scenes.

Accelerate investigations with AI in Datadog Incident Response

Engineering teams spend much of their incident response time investigating the problem and coordinating the response. Both tasks become harder when telemetry data lives in one place, deployment history is stored in another, and conversations unfold across chat channels and incident bridges. Responders often spend the first part of an incident rebuilding context before they can begin testing hypotheses and working toward resolution.

Datadog acquires Adaptive ML

Off-the-shelf models are easy to deploy, but they are rarely enough to solve complex, domain-specific challenges in production. The key to sustained AI value is not in the models themselves but in the ability to tune, evaluate, and refine those models against your organization’s real-time signals. We are excited to announce that Adaptive ML is joining Datadog to accelerate this vision by combining our deep observability data with their expertise in building specialized, high-performance AI agents.

5 pitfalls to avoid when measuring DevEx in the AI era

Developer experience, commonly known as DevEx, describes how an organization’s systems, workflows, tools, and culture affect developer productivity. A positive DevEx leads to tangible organizational benefits, including faster releases, increased innovation, and reduced technical debt. Measuring DevEx enables engineering management to quantify their team’s impact and understand where to direct improvement efforts.

Debug and evaluate your AI app from your coding agent with Datadog Agent Observability

Coding agents like Claude Code, Cursor, and Codex CLI handle the coding parts of building an AI application well. The harder work comes after: understanding why a response went wrong, building eval sets that reflect real production behavior, and keeping up with an application that changes faster than any one-off script can. Teams spend 60–80% of their time on evaluation and error analysis, and much of that work needs to be redone every time the stack shifts.

The Journey to Achieving Hyperscale Availability with AI-Driven Prediction

At hyperscale, a regional cloud outage is not merely a technical disruption—for Samsung Account, which serves 2.1 billion users across three global regions, it is an immediate global service crisis. Fragmented, region-siloed monitoring creates blind spots that make early detection nearly impossible, leaving SRE teams perpetually reactive rather than predictive. The path to proactive reliability requires both a philosophical shift and a foundational change in how observability data is collected, unified, and reasoned over.

From Legacy to AI-Ops: Securing and Scaling Systems for 20M Device Requests with Datadog

Modernizing a legacy system serving 20 million devices without users noticing is like replacing a jet engine mid-flight. In this session, YoungJin Jung and Donggen Hong from LG U+ share their 18-month journey transforming a Telco-scale API Gateway from a rigid, proprietary solution into a high-performance, open-source architecture on AWS, and the operational challenges they solved along the way.

Ship Reliable AI Faster: How to Operate AI Agents with Control and Confidence

Replace "AI shipped on hope" with an operating model that holds up once real users depend on it. AI quality is multi-dimensional, covering accuracy, tone, safety, and faithfulness to user data, and can't be debugged from outputs alone. Without visibility into what their AI actually did in production, teams miss regressions, reverse-engineer chains by hand, and watch a single bad answer erode trust built over hundreds of right ones.

Reduce CDN log costs with searchable archives

Engineering teams that manage high-volume log sources, such as content delivery network (CDN) edges, streaming platforms, and authentication systems, often have to make a difficult retention tradeoff. Indexing every event keeps logs searchable during investigations, audits, and postmortems, but it can make long-term retention expensive.