Operations | Monitoring | ITSM | DevOps | Cloud

A guide to cloud unit economics

As you analyze your organization's cloud spending, you'll often find that stakeholders have different perceptions of what that spending brings you. This is especially true when overall costs are rising and it's hard to distinguish waste from valuable investments in growth. But when finance, engineering, and product teams can all connect cloud spending to specific business outcomes, you gain the ability to make data-driven decisions about how to maximize the value of that spending.

Patterns for safe and efficient cache purging in CI/CD pipelines

"There are only two hard things in Computer Science: cache invalidation and naming things."—Phil Karlton In the age of increasingly frequent deploys, edge caching, and Jamstack adoption, caching plays a key role across the software delivery life cycle. In build and CI pipelines, caching compiled assets or dependencies helps reduce compute costs, speed up job runtimes, and lower the environmental impact (regarding energy usage) of repeated builds.

This Month in Datadog - July 2025

In July’s episode of This Month in Datadog, we’re doing things differently by spotlighting the people behind the products you rely on. Jeremy is joined by Tristan Ratchford to discuss saving time and effort when you’re on call with Bits AI SRE, and by Kevin Hu to explore gaining visibility into datasets across the entire data lifecycle with Data Observability.

Datadog Disaster Recovery mitigates cloud provider outages

A loss in infrastructure and applications observability can leave SRE and DevOps teams without insight into the real-time state of their production systems, causing them to temporarily pause code deployments and limit their ability to troubleshoot issues or respond to critical alerts. In modern cloud environments, where services are distributed and deeply interconnected, this lack of visibility can escalate quickly.

Bring high-performance observability to secure Kubernetes environments with Datadog's new CSI driver

In Kubernetes environments, applications often communicate with the Datadog Agent to send telemetry data such as custom metrics via DogStatsD or traces through Datadog APM. How this communication takes place depends on the communication mode set on the Datadog Cluster Agent's Admission Controller. With the sockets option, communication takes place through local inter-process communication via Unix domain sockets (UDS), whereas the service and default hostip options rely on network communication.

Why continuous profiling is the fourth pillar of observability

Developers have long used profilers to diagnose performance bottlenecks and improve the efficiency of their code. But a modern version of profiling, continuous profiling, is quietly redefining what profiling is and what it can do. By running nonstop in production with very low overhead, continuous profilers give teams always-on visibility into how their code behaves in the real world.

How Datadog Cloud Network Monitoring helps you move to a deny-by-default network egress policy at scale

When organizations first begin deploying workloads on Kubernetes, it's common for them to start with a permissive egress traffic policy that allows any workload to reach the internet. This approach can make it easier for teams to stay agile and to get services up and running in fast-moving environments. But as your Kubernetes footprint grows, it's important to minimize public internet access on a per-workload basis to improve your organization's security posture.

Monitor Lambda-hosted web apps with the Lambda Web Adapter integration

As organizations migrate their legacy web applications from containerized or server-based deployments to serverless environments, they often run into a critical compatibility challenge. Traditional web frameworks like Flask, Express, or SpringBoot are designed to run on persistent HTTP servers, not event-driven, stateless environments like AWS Lambda. The AWS Lambda Web Adapter bridges this gap by allowing teams to run web server-based applications inside Lambda with minimal changes.

Choosing the right OpenTelemetry Collector distribution

The OpenTelemetry (OTel) Collector plays a central role in collecting, processing, and exporting telemetry data. If you’re deploying the Collector in production, chances are you’ve reached for the otelcol-contrib distribution. It’s the easiest, most flexible, and most documented distribution, used in nearly every demo and getting-started guide. But here’s the catch: It’s not actually recommended for production use.

Missing container-layer metadata: Why it happens and what you can do

Container image layers provide valuable insight into what goes into a container, including which packages were installed, what commands were run, and where vulnerabilities might live. The metadata associated with these image layers is essential for debugging, optimizing image size, and managing security risks. However, key container-layer metadata fields such as digest, size, and created_by are sometimes missing, which can disrupt important tasks.