Operations | Monitoring | ITSM | DevOps | Cloud

Eliminate cloud waste across AWS, Azure, and Google Cloud with Cloud Cost Recommendations

As organizations increasingly adopt multi-cloud strategies, identifying areas to reduce cloud spend has become highly complex and time consuming. While there are many reasons that organizations choose to run their infrastructure in a multi-cloud environment, many do so to comply with regional data requirements, take advantage of best-of-breed offerings, or avoid vendor lock-in.

Reduce cloud waste with Datadog Cost Recommendations

Struggling to optimize your cloud spend across AWS, Azure, and Google Cloud? Datadog Cloud Cost Management highlights underutilized or legacy resources and lets engineers take immediate action using Datadog Workflows. Eliminate waste and drive savings with recommendations that your teams can trust.

Optimize Kubernetes and Container Costs with Datadog Cloud Cost Management

Struggling to understand the true cost of your Kubernetes workloads? With Datadog Cloud Cost Management, you can automatically allocate container costs by team, product, and service down to the pod. Instantly identify idle resources, surface optimization opportunities, and act with confidence. All in one unified platform.

How to surface misconfigured resources by defining policies | Datadog Tips & Tricks

Misconfigured infrastructure resources can be easy to miss, especially in multi-account or multi-cloud environments. From EKS clusters running on deprecated versions to RDS engines on extended support, these issues can disrupt services or drive up costs if left unchecked. In this video, we show you how to: By centralizing policies, you’ll gain a clear view of where to focus your remediation efforts.

We vibe coded a path tracer: Here's how we used static and dynamic analysis to fix it

When developing software, the longer you intend to keep a system around, the more important it becomes to prioritize its code quality. But as more organizations move toward microservice architectures and adopt agentic AI and LLMs into their development workflows, many engineering teams have increased their emphasis on accelerating developer velocity, often at the expense of code quality. This can often result in code that fails to meet standards for performance, reliability, and security.

Put Cloud Costs in Front of Engineers with Datadog Cloud Cost Management

Tired of surprises on your cloud bills? With Datadog Cloud Cost Management integrated into the Software Catalog, engineers see cost, performance, and reliability side by side—no context switching required. Give every service owner the visibility they need to make cost-aware decisions.

Track Cloud Unit Economics with Datadog Cloud Cost Management

Do you know the true cost per user, API call, or checkout? Datadog Cloud Cost Management lets you break down spend by combining cost, observability, and custom business metrics—all in one place. Track cost per transaction, alert on changes, and align engineering and finance with real-time unit economics.

What's new for scheduling and resource management in Kubernetes v1.34?

Kubernetes v1.34, which is scheduled for release August 27, 2025, focuses on improved scheduler visibility, deeper life cycle observability, and enhanced resource management. As always, the list of changes and improvements in the official changelog is extensive, and cluster operators may be wondering which changes are most important. If you're operating a monitoring platform or depend on deep Kubernetes observability, here's how a number of new features will affect your workflows.

Manage your dashboards and monitors at scale

In the early stages of building a system, a few well-placed dashboards and monitors can provide sufficient visibility into service health and performance. However, as infrastructure scales and teams grow, so does the complexity of the monitoring landscape. In organizations where individual teams manage their own services but rely on a central platform or observability team for tooling and guidance, this complexity can quickly multiply.

Instrument your Azure Container Apps workloads with the new Datadog Agent sidecar

Modern application development is evolving rapidly, with serverless containers and microservices becoming the standard for scalable, resilient architectures. Azure Container Apps is at the forefront of this movement, enabling developers to deploy containerized applications without having to manage infrastructure.

Identify slowdowns across your entire network with Datadog Network Path

As modern infrastructure becomes increasingly distributed across on-premises data centers, multi-cloud environments, ISPs, and remote offices, understanding how traffic flows across your network is critical to delivering reliable performance and great user experiences. But pinpointing the source of network slowdowns remains one of the most persistent challenges for operations, network, and IT teams.

Datadog governance 101: From chaos to consistency

As your organization scales, managing observability resources and usage becomes increasingly important. More users and teams mean more dashboards, tags, API keys, and costs to manage. The job of keeping track of these resources and ensuring that they’re compliant can quickly grow in complexity.

How we saved $1.5 million per year with Cloud Cost Management

In collecting and analyzing trillions of events each day, Datadog ingests a massive amount of data. We spend substantially to process and store this data in the cloud, and teams across the organization are committed to optimizing the return on this investment. To this end, our FinOps analysts have always tracked the costs of delivering our services and identified opportunities for savings.

How to use AI tools more effectively: Tips from Datadog Engineers

A growing number of engineering organizations have adopted or are trialing agentic AI-based coding tools and LLMs in an effort to increase their teams’ development velocity. If you’re a developer, this means you’ve likely had to try out different agentic tools and models and determine how to best incorporate them into your existing workflows.

Monitor Claude usage and cost data with Datadog Cloud Cost Management

Managing the cost of foundation models is a critical challenge as AI adoption surges, particularly for teams using powerful models like Anthropic's Claude Opus and Claude Sonnet. Growing teams generate larger prompt volumes and escalating model complexity, making it difficult to have clear visibility, accountability, and control of cloud AI spending.

Simplify XML log collection and processing with Observability Pipelines

In Microsoft-based environments, Windows event logs capture critical security events like user logins, privilege escalations, and system changes. These logs are vital for compliance and investigations. However, they’re natively formatted in XML, a verbose and deeply nested structure that is hard to search without preprocessing and inefficient to store.

Build secure and scalable Azure serverless applications with the Well-Architected Framework

Serverless platforms like Azure Functions and Azure Container Apps make it easier to scale your applications without managing infrastructure. But successful serverless apps require thoughtful planning. They must be designed to account for cold starts, unpredictable scaling behavior, and ephemeral compute lifecycles, all while ensuring secure data handling and end-to-end observability across highly distributed components.

Keep an eye on remote access to your Kubernetes infrastructure with Datadog Workload Protection

To improve efficiency and reduce cloud spending, teams frequently schedule pods on Kubernetes nodes dynamically, based on available resources. However, this practice has also introduced a new security challenge: The workloads maintained by a development team are now spread between Kubernetes nodes, exposing more hosts and increasing the blast radius when user credentials are compromised.

Tracing asynchronous systems in your event-driven architecture: When to use parent-child vs. span links

Asynchronous communication patterns are commonly used in distributed systems, especially in those that rely on events or messages to coordinate activity. Rather than responding to direct API calls like in a traditional request-response architecture, services in an asynchronous system produce, route, or consume events and messages independently.

How to build reliable and accurate synthetic tests for your mobile apps

Mobile applications offer increased flexibility to both users and developers. Users can access content on a wide range of devices, operating systems, and network types, while developers can leverage touch screens and orientation-based layouts to create more responsive features. However, all of these factors create new testing challenges. To ensure a good user experience (UX), developers have to test their apps across many device models and platforms, which can become costly and time-consuming.

Prevent cloud misconfigurations from reaching production with Datadog IaC Security

Modern infrastructure is built and deployed faster than ever, but increased speed can elevate risk. Developers who work on cloud-native applications often use infrastructure as code (IaC) to define cloud resources in configuration files, which are then shared across teams and deployed automatically. Although this approach is efficient, undetected misconfigurations in IaC can quickly introduce security risks into production environments.

A guide to cloud unit economics

As you analyze your organization's cloud spending, you'll often find that stakeholders have different perceptions of what that spending brings you. This is especially true when overall costs are rising and it's hard to distinguish waste from valuable investments in growth. But when finance, engineering, and product teams can all connect cloud spending to specific business outcomes, you gain the ability to make data-driven decisions about how to maximize the value of that spending.

Patterns for safe and efficient cache purging in CI/CD pipelines

"There are only two hard things in Computer Science: cache invalidation and naming things."—Phil Karlton In the age of increasingly frequent deploys, edge caching, and Jamstack adoption, caching plays a key role across the software delivery life cycle. In build and CI pipelines, caching compiled assets or dependencies helps reduce compute costs, speed up job runtimes, and lower the environmental impact (regarding energy usage) of repeated builds.