Operations | Monitoring | ITSM | DevOps | Cloud

Route your monitor alerts with Datadog monitor notification rules

As organizations scale their infrastructure, monitoring systems can become a source of noise rather than insight. A clean, straightforward set of alerts for a handful of services can quickly spiral into a mess of overlapping thresholds, redundant triggers, and inconsequential notifications across hundreds (or thousands) of components. This flood of notifications can slow response times, overwhelm engineers, and increase the chance of overlooking critical problems.

Improve SLO accuracy and performance with Datadog Synthetic Monitoring

SLOs are key for improving user satisfaction, prioritizing engineering projects, and measuring overall performance. Given the important role that SLOs play in determining organizational benchmarks, teams need to ensure that SLO metrics—also called service level indicators (SLIs)—are reported accurately and maintained consistently within an acceptable range.

Trace Distributed Map states for AWS Step Functions with Datadog

AWS Step Functions offers the Distributed Map state, enabling you to coordinate massively parallel workloads within your serverless applications. With this feature, a single Step Functions execution can fan out into up to 10,000 parallel workflows simultaneously, making it possible to efficiently process millions of items in parallel. This capability unlocks new possibilities for large-scale data processing, such as image transformation, log ingestion, or batch analytics.

Stay Compliant: Meet Your Audit Needs with Datadog!

Datadog's internal compliance team has built audit workflows and control monitoring capabilities using the Datadog platform. We actively use these capabilities to scale our audit programs and comply with multiple compliance frameworks. This session will go into the details of how we addressed our compliance use-cases using the Datadog platform and how our customers can get started.

How Cursor scaled infrastructure rapidly and reliably using Datadog

At Datadog, we use Cursor to empower our teams to build more quickly. And we know that building and troubleshooting with AI tools like Cursor is done best with the right observability data and context. Discover how Cursor was able to rapidly and reliably scale their infrastructure 100x using Datadog to meet the needs of a fast growing user base. And learn more about how we’re bring Datadog tools and context to your favorite AI IDEs and agents with our MCP Server and extensions.

AI-Augmented Control Plane: Scaling IT Operations with Intelligent Automation

How do you enable a team of 100 engineers to effectively support 300+ critical applications across five hosting platforms? At Thomson Reuters, we turned to AI - not as a buzzword, but as a genuine force multiplier. Experience our journey of transforming traditional IT operations into an AI-augmented powerhouse, where Datadog, ServiceNow, and custom AI solutions work in harmony to create a next-generation control plane. We'll share real victories, honest challenges, and practical insights from our mission to build a more intelligent operational framework.

LLM Observability for Reliability and Stability: A Monitoring Strategy for Phone Communication

LLM APIs offer groundbreaking potential, but also present challenges such as response latency, hallucinations, and service instability. In Japan, where telephone communication remains crucial for business, these issues present significant barriers to the introduction of LLM-based applications. Despite being a relatively young startup, we have developed and deployed an LLM-based telephone service with over 40 million calls.

Datadog + OpenAI: Codex CLI integration for AIassisted DevOps

We are exploring how we can help on-call engineers troubleshoot incidents more effectively by providing the OpenAI Codex agent with access to real-time observability data in terminals. We've developed an integration and new tool visualizations that connect OpenAI's Codex CLI to the new Datadog MCP server. In this post, we'll share what we've been experimenting with: enabling an AI agent to retrieve production metrics, logs, and incidents from Datadog in real time and act on that context.

Optimize and troubleshoot AI infrastructure with Datadog GPU Monitoring

As organizations bring more AI and LLM workloads into production, the underlying GPU infrastructure that supports these workloads becomes even more critical in ensuring these workloads remain fast, reliable, and scalable. Inefficient GPU resource usage, for instance, can lead to longer runtimes and reduced throughput, negatively impacting overall model performance. Additionally, idle and underutilized GPUs can quickly drive up costs and lead to needless spending.

Datadog MCP Server: Connect your AI agents to Datadog tools and context

As development teams adopt AI-powered tools and build services that make use of AI agents, they want to extend their AI capabilities to incorporate familiar tools and observability data. However, AI agents struggle with regular API endpoints and frequently fail when parsing complex nested JSON hierarchies or incorrectly handling errors. As a result, these agents often fail to retrieve relevant results.