Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Pastries with SREs: No compromises on cost-effective observability or donuts.

In this episode of Pastries and SREs, we dig into how vendor lock-in and sky-high observability costs are forcing teams to choose between coverage and budget, AND why you shouldn’t have to settle. With donuts in hand, we explore how to take back control of your observability strategy by making it cost-effective, comprehensive, and flexible.

How to Choose an AI SRE Solution

The AI SRE landscape has exploded over the past year, with vendors racing to add artificial intelligence capabilities to their platforms. For engineering leaders evaluating these solutions, the sheer number of options can feel overwhelming. Some vendors are building AI-native solutions from scratch, while others are retrofitting AI onto existing workflows. Cloud providers are embedding agents into their ecosystems, and observability platforms are adding intelligence layers to their telemetry data.

OpenTelemetry Metrics in Quarkus Explained

When you run services on Quarkus, you need a steady stream of signals to understand how the application behaves—CPU trends, request timings, memory patterns, and how each endpoint responds under load. Metrics give you that visibility. They help answer questions like: OpenTelemetry fits well here because it gives Quarkus a common way to generate and export metrics without locking you into a specific monitoring tool.

How Rootly works with Slack | An end-to-end demo.

Rootly is the AI-native on-call and incident management platform that helps you resolve incidents faster, improve system resilience, and streamline on-call operations. It’s your always-on SRE copilot that automates root cause analysis and identifies patterns that drive continuous improvement—trusted by thousands of companies like LinkedIn, NVIDIA, Replit, Elastic, Canva, Clay, Tripadvisor, and Grammarly.

How Prometheus Exporters Work With OpenTelemetry

Running distributed systems means you need clear visibility into how your services behave. Prometheus has been the standard for metrics for a long time, and OpenTelemetry is now giving teams a more consistent way to collect telemetry across their stack. In many setups, you'll have both: existing Prometheus instrumentation that's already in place, and new components instrumented with OpenTelemetry.

Bits AI SRE, Flex Frozen, and GPU Monitoring | DASH 2025

Get a first look at Datadog’s biggest product reveals from DASH 2025. Meet Bits AI SRE, your 24/7 autonomous AI Site Reliability Engineer, Flex Frozen for up to 7 years of managed log retention, and GPU Monitoring for full visibility into your AI workloads. Experience the future of observability in action.

What Are AI Guardrails

When you're shipping LLM features, a lot of the work goes into keeping the model's behavior predictable. You deal with questions like: These are everyday concerns when you integrate LLMs into production systems. Guardrails AI provides a Python framework that helps you enforce those expectations. You define the schema or constraints you need, and the framework validates both the inputs going into the model and the outputs coming back.

Pastries with SREs: From AIOps to GenAI and LLMs (lactose-free latte making)

In this episode of Pastries with SREs, we look at AIOps, where it fell short, where it worked, and how generative AI (GenAI) is reshaping what’s possible in observability today. We explore: If you’re wondering whether generative AI is different this time, this episode offers a grounded, practical look at how it’s evolving observability workflows.

You Can't Fix What You Don't Measure: Observability in the Age of AI with Conor Bronsdon

Only 50% of companies monitor their ML systems. Building observability for AI is not simple: it goes beyond 200 OK pings. In this episode, Sylvain Kalache sits down with Conor Brondsdon (Galileo) to unpack why observability, monitoring, and human feedback are the missing links to make large language model (LLM) reliable in production.

Grafana Tempo: Setup, Configuration, and Best Practices

As systems grow, understanding how a request moves across multiple services becomes harder. Traces help bring this picture together by showing the exact path a request takes, along with the timings that matter. Grafana Tempo is built for this kind of workload. It stores traces efficiently, works well with OpenTelemetry, and keeps the operational overhead low.