Latest Posts

How we've created a successful FinOps practice at Datadog

Jun 30, 2025 By David M. Lentz In Datadog

When you adopt FinOps to maximize the value of your cloud spending, you may have some simple first steps you can take to gain cost efficiency. For example, you can find and delete any unused resources to quickly realize a one-time optimization. But the ongoing work to manage cloud costs becomes complex as your organization grows, your infrastructure spans multiple clouds, and you can't easily see the full value of your cloud spending by tracking only the bottom line.

Read Post

Datadog

Read more about How we've created a successful FinOps practice at Datadog

Route your monitor alerts with Datadog monitor notification rules

Jun 27, 2025 By Khang Truong In Datadog

As organizations scale their infrastructure, monitoring systems can become a source of noise rather than insight. A clean, straightforward set of alerts for a handful of services can quickly spiral into a mess of overlapping thresholds, redundant triggers, and inconsequential notifications across hundreds (or thousands) of components. This flood of notifications can slow response times, overwhelm engineers, and increase the chance of overlooking critical problems.

Read Post

Datadog

Read more about Route your monitor alerts with Datadog monitor notification rules

Improve SLO accuracy and performance with Datadog Synthetic Monitoring

Jun 26, 2025 By Addie Beach In Datadog

SLOs are key for improving user satisfaction, prioritizing engineering projects, and measuring overall performance. Given the important role that SLOs play in determining organizational benchmarks, teams need to ensure that SLO metrics—also called service level indicators (SLIs)—are reported accurately and maintained consistently within an acceptable range.

Read Post

Datadog

Read more about Improve SLO accuracy and performance with Datadog Synthetic Monitoring

Trace Distributed Map states for AWS Step Functions with Datadog

Jun 25, 2025 By Abhinav Vedmala In Datadog

AWS Step Functions offers the Distributed Map state, enabling you to coordinate massively parallel workloads within your serverless applications. With this feature, a single Step Functions execution can fan out into up to 10,000 parallel workflows simultaneously, making it possible to efficiently process millions of items in parallel. This capability unlocks new possibilities for large-scale data processing, such as image transformation, log ingestion, or batch analytics.

Read Post

Datadog

Read more about Trace Distributed Map states for AWS Step Functions with Datadog

Datadog + OpenAI: Codex CLI integration for AIassisted DevOps

Jun 12, 2025 By Reilly Wood In Datadog

We are exploring how we can help on-call engineers troubleshoot incidents more effectively by providing the OpenAI Codex agent with access to real-time observability data in terminals. We've developed an integration and new tool visualizations that connect OpenAI's Codex CLI to the new Datadog MCP server. In this post, we'll share what we've been experimenting with: enabling an AI agent to retrieve production metrics, logs, and incidents from Datadog in real time and act on that context.

Read Post

Datadog

Read more about Datadog + OpenAI: Codex CLI integration for AIassisted DevOps

Optimize and troubleshoot AI infrastructure with Datadog GPU Monitoring

Jun 10, 2025 By Anjali Thatte In Datadog

As organizations bring more AI and LLM workloads into production, the underlying GPU infrastructure that supports these workloads becomes even more critical in ensuring these workloads remain fast, reliable, and scalable. Inefficient GPU resource usage, for instance, can lead to longer runtimes and reduced throughput, negatively impacting overall model performance. Additionally, idle and underutilized GPUs can quickly drive up costs and lead to needless spending.

Read Post

Datadog

Read more about Optimize and troubleshoot AI infrastructure with Datadog GPU Monitoring

Datadog MCP Server: Connect your AI agents to Datadog tools and context

Jun 10, 2025 By Bowen Chen In Datadog

As development teams adopt AI-powered tools and build services that make use of AI agents, they want to extend their AI capabilities to incorporate familiar tools and observability data. However, AI agents struggle with regular API endpoints and frequently fail when parsing complex nested JSON hierarchies or incorrectly handling errors. As a result, these agents often fail to retrieve relevant results.

Read Post

Datadog

Read more about Datadog MCP Server: Connect your AI agents to Datadog tools and context

Automatically identify issues and generate fixes with Bits AI Dev

Jun 10, 2025 By Mike Leach In Datadog

Developers lose hours each week to a familiar troubleshooting loop: chase down telemetry across dashboards, decipher vague errors, and juggle alerts to find the signal worth fixing. Production issues, performance regressions, and security vulnerabilities all demand attention, but they often come with little context for taking action.

Read Post

Datadog

Read more about Automatically identify issues and generate fixes with Bits AI Dev

Improve performance and reliability with Proactive App Recommendations

Jun 10, 2025 By Yoann Robin In Datadog

As your organization grows, you may operate in increasingly complex environments and manage more services and larger teams to maintain them. Evolution like this can lead to an explosion of telemetry data from across your stack, including metrics, traces, logs, and frontend interactions. The benefit of greater visibility is often outweighed by the challenge of acting on the data you collect, and you can easily fall behind on implementing the fixes your services require to operate reliably and efficiently.

Read Post

Datadog

Read more about Improve performance and reliability with Proactive App Recommendations

Ensure trust across the entire data life cycle with Datadog Data Observability

Jun 10, 2025 By Nicholas Thomson In Datadog

As data systems grow more complex and data becomes even more business-critical, teams struggle to detect and resolve issues that impact data quality, reliability, and, ultimately, trust. Engineers have to rely on manual checks and ad hoc SQL queries to catch data quality issues—often after teams relying on the data have noticed something has gone wrong.

Read Post

Datadog

Read more about Ensure trust across the entire data life cycle with Datadog Data Observability

Operations | Monitoring | ITSM | DevOps | Cloud

How we've created a successful FinOps practice at Datadog

Route your monitor alerts with Datadog monitor notification rules

Improve SLO accuracy and performance with Datadog Synthetic Monitoring

Trace Distributed Map states for AWS Step Functions with Datadog

Datadog + OpenAI: Codex CLI integration for AIassisted DevOps

Optimize and troubleshoot AI infrastructure with Datadog GPU Monitoring

Datadog MCP Server: Connect your AI agents to Datadog tools and context

Automatically identify issues and generate fixes with Bits AI Dev

Improve performance and reliability with Proactive App Recommendations

Ensure trust across the entire data life cycle with Datadog Data Observability

Monthly Archive

Follow Us