Operations | Monitoring | ITSM | DevOps | Cloud

How we built an AI SRE agent that investigates like a team of engineers

We built Bits AI SRE to help engineers investigate and solve production incidents, one of the most difficult aspects of operating distributed systems today. As environments grow more dynamic and complex, resolving issues becomes more challenging. Failures now span more services, involve noisier signals, and encompass larger volumes of telemetry data, making it hard for on-call engineers to find root causes quickly. Today, Bits AI SRE is already helping teams decrease time to resolution by up to 95%.

Automate flaky test fixes with the Bits AI Dev Agent and Test Optimization

Flaky tests are a significant source of inefficiency that impacts many engineering teams. Along with failing your build, they interrupt your entire development flow, generate excessive CI/CD noise, and, critically, compromise developer trust in the test suite itself. Datadog Test Optimization enables you to manage test suites at scale by pinpointing the flakiest tests, analyzing their history across hundreds of runs, and automatically surfacing the root cause.

Datadog integrations 2025 recap: Observability for AI, security, and hybrid cloud

The year 2025 marked a major milestone in the Datadog integrations ecosystem as we surpassed 1,000 integrations. Along the way, we also added over 110 new technology partners and expanded coverage across the fastest growing software categories, including AI, distributed security, hybrid infrastructure, and data intelligence. This recap highlights the most impactful integrations we released this year and how they connect to these broader technology trends.

Bring faster visibility into AWS Lambda functions with remote instrumentation

Comprehensive observability is critical for running performant, reliable, and secure serverless workloads. However, configuring and maintaining that visibility across hundreds or thousands of serverless functions can be difficult to scale and sustain. Developers across teams often manage serverless functions using different infrastructure as code (IaC) frameworks, as well as different review, deployment, and update processes.

Build custom apps in seconds with conversational AI in App Builder

Using a drag-and-drop interface, engineering teams can create apps that support troubleshooting, improve day-to-day operations, and offer self-service access without leaving Datadog. With the new conversational AI feature, teams can turn an idea into a working app in seconds. Watch the video to see how it works..

Implement dbt data quality checks with dbt-expectations

dbt is one of the most popular solutions for data transformations and modeling. Many commercial data pipelines rely on dozens, or even hundreds, of individual dbt jobs. Data engineers, data platform engineers, and analytics engineers who own these pipelines need to maintain a testing framework to prevent mistakes in data processing that can compromise analysis.

Troubleshoot faster with the GitLab Source Code integration in Datadog

Developers and SREs who rely on GitLab to develop their services often face significant friction when troubleshooting errors or fixing issues that degrade code quality. To understand the context of a problem, they resort to tab-hopping between observability tools and GitLab, connecting stack traces, spans, and profiles back to the right files and commits.

Check out features we announced at AWS re:Invent in the latest episode of This Month in Datadog

Tune in for spotlights of Bits AI SRE, now generally available, and Datadog’s MCP Server, which connects AI agents to our platform by ingesting prompts and mapping them to Datadog resources and data. Plus, we cover how to: Search logs at petabyte scale in your own infrastructure with CloudPrem Break down costs drivers at the prefix level with Storage Management Create workflows that adapt to real-world complexity with Agent Builder Detect and block credential leaks with Secret Scanning.

Normalize any logs for Cloud SIEM with Datadog's OCSF processor

Security teams need visibility across every system they defend, including cloud platforms, SaaS applications, security controls, identity providers, and custom services. But those systems all produce logs in different formats, with inconsistent field names and structures. That lack of standardization makes it harder to correlate events, write reusable detections, and investigate incidents quickly.