Operations | Monitoring | ITSM | DevOps | Cloud

How Datadog Feature Flags is resilient to cloud provider failures

As major incidents like AWS’s October 2025 outage illustrate, modern systems are immensely interconnected. A failure in one can lead to a cascade of downstream problems. In this case, issues with DNS resolution for DynamoDB led to widespread disruptions with other AWS services and, subsequently, thousands of applications and services that rely on that infrastructure.

Optimizing Ruby performance: Observations from thousands of real-world services

Over the past three decades, Ruby has assumed a pivotal role in the modern web stack and become a fixture in the tool kits of countless DevOps and platform teams. Today, it is a driving force in contemporary application development, testing, automation, and CI/CD. For this blog post, we used data from our always-on continuous profiling of more than 3,000 real-world services from hundreds of organizations to track trends in Ruby usage and performance.

Introducing Datadog Agent Builder: Build agentic workflows for alert response and remediation

Building automated workflows that adapt to real-world complexity can be a challenge. As systems scale and scenarios multiply, teams often end up hardcoding endless logic branches just to handle every potential outcome. That’s why we’re introducing Datadog Agent Builder, a powerful new tool that lets you create custom AI agents that are fully hosted by Datadog.

Sync your Backstage catalog with Datadog IDP

Backstage is a popular open source framework for building internal developer portals (IDPs) used by organizations to aggregate service metadata and create a single source of truth for their software developers. However, data stored in the Backstage Software Catalog can quickly become siloed and inaccessible from monitoring tools such as Datadog.

Eliminate unnecessary costs in your Amazon S3 buckets with Datadog Storage Management

Cloud object storage powers a wide range of workloads, from AI training datasets to customer-facing media libraries. As your data grows into the petabyte scale, managing storage costs and ensuring reliability requires fine-grained visibility. You need answers to questions like: Which specific teams, services, workloads, or datasets are driving spend? Which data is cold and should be archived? What fixes will have the biggest impact on cost and performance?

Observability and FedRAMP in Action: The VA's Mission to Deliver Reliable Digital Service

Ensuring digital services remain accessible, reliable, and secure is a high priority for any organization operating at scale. For the Department of Veterans Affairs (VA), this focus is central to its mission of providing quality care to veterans, their families, and caregivers. Often described as “the largest IT shop in the United States,” the VA manages 2.7 million pieces of equipment across a vast network of interconnected systems.

How feedback loops power progressive software delivery

Modern engineering teams face competing priorities. Developers are expected to deliver new features faster than ever, but users expect rock-solid reliability with every release. Shipping quickly can feel like you’re gambling with user trust. If you move too fast, you risk outages, but if you move too slowly, innovation stalls.

Import Snowflake, Salesforce, ServiceNow, and Databricks metadata into Datadog with Reference Tables

Engineering, operations, and security teams can struggle to make sense of their telemetry data in isolation. Logs, metrics, and events tell what is happening but are often missing critical metadata like who owns what, where it's coming from, or indicators of attack. These gaps in visibility slow down incident response, complicate cost control, and make business or security analytics much harder.

Catch and remediate ECS issues faster with default monitors and the ECS Explorer

Organizations that run applications on Amazon Elastic Container Service (Amazon ECS) often juggle signals across container and task metrics, logs, and events while they hunt for the change or condition that broke a deployment. This work adds operational overhead and extends incident timelines as teams switch between tools and manually correlate symptoms.

Key learnings from the State of Containers and Serverless report

We recently released the 2025 State of Containers and Serverless report, which examines cloud usage data from tens of thousands of Datadog customers. The study shows adoption trends across container orchestration platforms and serverless offerings, and it explores how organizations use those resources to optimize workloads for efficiency, cost, and simplicity.