Operations | Monitoring | ITSM | DevOps | Cloud

Driving AI ROI: How Datadog connects cost, performance, and infrastructure so you can scale responsibly

AI innovation has accelerated faster than most organizations’ ability to monitor and manage it. The shift from experimentation to production-scale workloads has driven a new class of operational challenges: rising GPU costs, opaque model performance, and the difficulty of linking spend to business value. As AI investments grow, executives need a unified way to measure efficiency and return without slowing down innovation.

Detect, diagnose, and resolve network issues easily with CNM Network Health

In many organizations, developers, SREs, network engineers, and security teams work in specialized domains, which can make it hard to establish a shared view of network health. As a result, engineers often struggle to determine when a network problem that originates outside of their domain of expertise is the root cause of an incident. This lack of visibility slows investigations and delays remediation.

Drive business outcomes with Unit Economics in Datadog Cloud Cost Management

See how Datadog turns cloud usage and performance data into actionable business insights by helping teams calculate unit economics to measure and optimize the efficiency of every service. You’ll discover how to: Datadog bridges the gap between cloud costs and business value—helping organizations get the most value out of their cloud investment.

How microservice architectures have shaped the usage of database technologies

In the late 2000s, the big question in database design was SQL or NoSQL. While relational databases had long held their ground, document and key-value stores were emerging as serious alternatives. Many predicted a zero-sum, winner-take-all outcome. But when we look at how organizations are using database technologies today, no single tool or category has dominated the landscape.

Securing customer logins with breach intelligence

Account takeovers (ATOs) are one of the most common threats facing online platforms. Attackers buy leaked usernames and passwords on underground markets then test them at scale across websites, hoping that password reuse will give them easy access. Today, ATOs have grown so sophisticated and fast-moving that manual incident response often can’t keep pace, requiring intelligent defense systems for detecting compromised credentials and preventing misuse at scale.

Turning errors into product insight: How early-stage teams can connect engineering data to user impact

Early-stage engineering teams ship fast and learn in production. While speed is a competitive advantage, it can also lead to a high volume of noisy signals, like stack traces, timeouts, and dashboards full of red. Some of those problems can affect your users and revenue, but many don’t.

A FinOps engineer's guide to governing custom metrics

This guest blog post is authored by Dieter Matzion, a seasoned cloud practitioner who has operated exclusively in public cloud environments since 2013, with experience at leading technology companies including Google, Netflix, Intuit, and Roku. Custom metrics play a crucial role in enabling teams to monitor their applications and businesses. The flexibility of these metrics allows engineers to measure what matters most to their domain.

Day 2 with Cilium: Small configurations that keep large clusters boring

Operating Cilium at a small scale is straightforward. You install the Helm chart, choose a routing mode, and apply a few network policies. Day 1 is about getting packets to flow. Day 2 is about keeping them boring. At Datadog, we run Cilium across hundreds of Kubernetes clusters, tens of thousands of nodes, and hundreds of thousands of pods in multiple clouds. When operating at this scale, small configuration choices stop being minor details and start becoming risk multipliers.