Operations | Monitoring | ITSM | DevOps | Cloud

Coordinate large-scale engineering initiatives with IDP Campaigns

As organizations grow, engineering leaders often need to drive cross-team initiatives such as reducing cloud spend, upgrading runtimes, or strengthening security controls. Tracking this work can quickly become fragmented across spreadsheets, dashboards, and status meetings. Progress is hard to measure, accountability is unclear, and the impact of each effort can be difficult to demonstrate.

Use OpenTelemetry with Observability Pipelines for vendor-neutral log collection and cost control

Today, many DevOps and security teams operate in a world of complex, hybrid, or multi-vendor environments. As more teams look to avoid lock-in by adopting open standards, OpenTelemetry (OTel) is quickly gaining adoption as the primary open source method for DevOps and security teams to instrument and aggregate their telemetry data. However, OTel alone may lack the advanced processing functions, native volume control rules, and hybrid environment support that large organizations need.

How Datadog Feature Flags is resilient to cloud provider failures

As major incidents like AWS’s October 2025 outage illustrate, modern systems are immensely interconnected. A failure in one can lead to a cascade of downstream problems. In this case, issues with DNS resolution for DynamoDB led to widespread disruptions with other AWS services and, subsequently, thousands of applications and services that rely on that infrastructure.

Explore Cloud Instance Pricing and Performance with Datadog Instance Explorer

Meet Datadog Instance Explorer — a way to explore, compare, and monitor cloud instance pricing and performance across AWS, Azure, and Google Cloud in one place. In this quick overview, you’ll learn how to: Start exploring your instance options today and make smarter, data-driven infrastructure decisions.

Optimizing Ruby performance: Observations from thousands of real-world services

Over the past three decades, Ruby has assumed a pivotal role in the modern web stack and become a fixture in the tool kits of countless DevOps and platform teams. Today, it is a driving force in contemporary application development, testing, automation, and CI/CD. For this blog post, we used data from our always-on continuous profiling of more than 3,000 real-world services from hundreds of organizations to track trends in Ruby usage and performance.

Introducing Datadog Agent Builder: Build agentic workflows for alert response and remediation

Building automated workflows that adapt to real-world complexity can be a challenge. As systems scale and scenarios multiply, teams often end up hardcoding endless logic branches just to handle every potential outcome. That’s why we’re introducing Datadog Agent Builder, a powerful new tool that lets you create custom AI agents that are fully hosted by Datadog.

Datadog GPU Monitoring: Optimize and troubleshoot AI infrastructure

With Datadog GPU Monitoring, engineering and ML teams can monitor GPU fleet health across cloud, on-prem, and GPU-as-a-Service platforms like Coreweave and Lambda Labs. Real-time insights into allocation, utilization, and failure patterns make it easy to spot bottlenecks, eliminate idle GPU spend, and resolve provisioning gaps. By tying usage metrics directly to cost and surfacing hardware and networking issues impacting performance, Datadog helps teams make fast, cost-efficient decisions to keep AI workloads running reliably at scale.

Bringing Observability to Data

While observability practices have evolved in recent years, they have largely focused on application services and infrastructure. Yet it is data what powers our applications, businesses, and AI models. When data issues occur, the consequences can be far reaching, from poor product experiences to billing errors to misinformed AI outcomes. In this session, Jonathan Morin, Group Product Manager at Datadog, shares real-world examples of incidents and explains how data observability can address them, helping teams detect issues earlier, reduce costly downtime, and restore trust in their data.

The Hidden Bottleneck in Latency: GetYourGuide's Database Performance Journey

Fast front-end and back-end code alone won’t guarantee low end-to-end latency as hidden bottlenecks in the database can undermine even the best engineering efforts. In this session, Oleksii Serhiienko, Senior Site Reliability Engineer at GetYourGuide, will share how his team put database performance at the center of their monitoring strategy. He will highlight how they identified and fixed slow queries, uncovered load balancing issues that drove significant cost savings, and built monitoring practices that improved both reliability and investigation workflows.

Use Grok parsing to extract fields from logs | Datadog Tips & Tricks

When your logs don’t follow a standard format, it can be difficult to extract valuable information, like key-value pairs and nested JSON objects. Grok parsing lets you define flexible patterns that match unstructured log data so you can extract specific fields to query, filter, and visualize. In this video, you’ll learn how to: By refining your Grok parsers, you can make your logs more useful for analytics, dashboards, or alerts, and get even more value from your logs.