Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Datadog Bits AI SRE: Your new teammate for on-call shifts

Bits AI SRE is an always-on SRE agent built to handle complex troubleshooting and late-night alerts. Developed against thousands of real-world incidents and powered by Datadog’s platform, Bits AI SRE analyzes your entire stack, tests hypotheses, and identifies root causes in minutes. Resolve faster, get back to sleep sooner, and give your on-call team the confidence and capacity they need.

9 Monitoring Tools That Deliver AI-Native Anomaly Detection

The observability market has moved beyond manual threshold-setting. Modern platforms use statistical algorithms, machine learning, and causal AI to detect anomalies automatically. Some work immediately after deployment. Others train on your data for better accuracy. Each approach has technical trade-offs worth understanding. This guide compares how nine monitoring solutions handle automated anomaly detection and root cause analysis.

Instrument Jenkins With OpenTelemetry

You can instrument Jenkins with OpenTelemetry using the official plugin and an OpenTelemetry Collector, then send the data to a backend like Last9 to understand where pipeline latency and failures actually originate. Jenkins provides job status and console logs, but it doesn't show how time is distributed across stages, agents, plugins, and external systems. OpenTelemetry fills that gap by emitting traces, metrics, and logs in a standard format that any OTLP-compatible backend can process.

Pastries with SREs: Holding onto extra observability data and desserts

In this episode of Pastries with SREs, we dig into why you should keep all of your observability data, even if you don’t need it quite yet. We explore: With enriched logs and flexible, cost-effective storage, you can stop worrying about what you might need later and start answering questions with confidence, no matter when they arise. Additional resources.

How AI Agents Are Redefining the SRE Role

Even the best site reliability engineers (SREs) spend too much time doing reactive work—triaging incidents, gathering context, escalating to the right teams, and documenting what happened. That work is essential, but it’s not where an SRE’s highest value lies. These engineers are hired to build and maintain resilient systems, not play air-traffic control with every alert that hits their queue.

Lessons from KubeCon: What "Best-of-Breed" AI SRE Really Requires

This year’s KubeCon underscored a real shift: AI SRE has gone mainstream. Of course, it’s not a surprise. Teams from high-growth startups to Fortune 500s are running more complex, cloud-native systems, shipping more AI-generated code, and facing rising expectations. Downtime is absolutely not an option and the work for on-call SREs has become unsustainable. The question isn’t whether AI SRE helps. It’s which one you can trust in production.

7 Observability Solutions for Full-Fidelity Telemetry

You don’t have to choose between capturing every signal and keeping costs predictable. Modern observability stacks blend full-fidelity storage (time series or columnar systems like ClickHouse and Apache Druid), tail-based sampling for heavy traffic, and tiered storage (hot/warm/cold with S3-backed archives). This gives you full-fidelity incident forensics with the day-to-day cost profile of a sampled setup.

Mezmo + Catchpoint deliver observability SREs can rely on

For SREs juggling multiple services, third-party dependencies, and constant alerts, a critical service slowdown can quickly turn into chaos. APM Dashboards may show everything is fine, yet users are still experiencing problems. That gap—between application telemetry and real-world performance—can turn a five-minute fix into a two-hour war room. ‍

Introducing Bits AI SRE, your AI on-call teammate

Bits AI SRE is your AI on-call teammate, built to autonomously investigate alerts and coordinate incident response. Integrated with Datadog, Slack, GitHub, Confluence, and more, Bits analyzes telemetry, reads documentation, and reviews recent deployments to determine the root cause of alerts—often before you’ve even opened your laptop. In fact, if you're using Datadog On-Call, you can view Bits’s findings right from your phone—so you’re always one step ahead, no matter where you are.

Top 7 Observability Platforms That Auto-Discover Services

You can use an observability platform that automatically discovers your services and provides ready-to-use dashboards with minimal setup. If you're running a system where microservices come and go, containers shift around, or serverless functions scale up quickly, this kind of experience saves you a lot of time. You gain visibility as soon as something goes live, without requiring any additional steps on your part. In this blog, we talk about the top seven platforms that offer these capabilities.