%term

The latest News and Information on Service Reliability Engineering and related technologies.

How Prometheus 3.0 Fixes Resource Attributes for OTel Metrics

Jul 28, 2025 By Anjali Udasi In Last9

When you export OpenTelemetry metrics to Prometheus, resource fields like service.name or deployment.environment don’t show up as metric labels. Prometheus drops them. To use them in queries, you’d have to join with target_info: This makes filtering and grouping more difficult than necessary. Prometheus 3.0 changes that. It supports resource attribute promotion—automatically converting OpenTelemetry resource fields into Prometheus labels.

Read Post

Last9

Read more about How Prometheus 3.0 Fixes Resource Attributes for OTel Metrics

OTel Weaver: Consistent Observability with Semantic Conventions

Jul 28, 2025 By Anjali Udasi In Last9

Deploying a new service shouldn’t break dashboards. But it happens, usually because metric names or labels aren’t consistent across teams. You end up with traces that don’t link, metrics that don’t align, and queries that take hours to debug, not because the system is complex, but because the telemetry is fragmented. OTel Weaver addresses this by enforcing OpenTelemetry semantic conventions at the source.

Read Post

Last9

Read more about OTel Weaver: Consistent Observability with Semantic Conventions

How sum_over_time Works in Prometheus

Jul 25, 2025 By Faiz Shaikh In Last9

The sum_over_time() function in Prometheus gives you a way to aggregate counter resets, gauge fluctuations, and histogram samples across specific time windows. Instead of seeing point-in-time values, you get the cumulative total of all data points within your chosen range—useful for calculating totals from rate data, tracking accumulated errors, or understanding resource consumption patterns over custom intervals.

Read Post

Last9

Read more about How sum_over_time Works in Prometheus

Use Telegraf Without the Prometheus Complexity

Jul 24, 2025 By Anjali Udasi In Last9

Every system needs observability. You need to know what your CPU, memory, disk, and network are doing, and maybe keep an eye on database query latency or Redis connection counts. But setting that up isn’t always simple. You start with a couple of shell scripts. Then come exporters. Then Prometheus. Before long, you’re managing scrape configs, tuning retention, and watching dashboards fail under load after two days of data.

Read Post

Last9

Read more about Use Telegraf Without the Prometheus Complexity

Introducing Bits AI SRE, your AI on-call teammate

Jul 23, 2025 By Datadog In Datadog

Bits AI SRE is your AI on-call teammate, built to autonomously investigate alerts and coordinate incident response. Integrated with Datadog, Slack, GitHub, Confluence, and more, Bits analyzes telemetry, reads documentation, and reviews recent deployments to determine the root cause of alerts—often before you’ve even opened your laptop. In fact, if you're using Datadog On-Call, you can view Bits’s findings right from your phone—so you’re always one step ahead, no matter where you are.

View Video

Datadog

Read more about Introducing Bits AI SRE, your AI on-call teammate

Ship Confluent Cloud Observability in Minutes

Jul 22, 2025 By Anjali Udasi In Last9

You're running Kafka on Confluent Cloud. You care about lag, throughput, retries, and replication. But where do you see those metrics? Confluent gives you metrics, sure, but not all in one place. Some live behind a metrics API, others behind Connect clusters or Schema Registries. You either wire them manually or give up. What if you could stream those metrics to a platform built for high-frequency, high-cardinality time series, and do it in minutes?

Read Post

Last9

Read more about Ship Confluent Cloud Observability in Minutes

Latest research from Meta AI, MedRAX, and Rootly AI

Jul 22, 2025 By Rootly In Rootly

View Video

Rootly

Read more about Latest research from Meta AI, MedRAX, and Rootly AI

Monitor Nginx with OpenTelemetry Tracing

Jul 21, 2025 By Prathamesh Sonpatki In Last9

At 3:47 AM, your NGINX logs show a 500 error. Around the same time, your APM flags a spike in API latency. But what's the root cause, and why is it so hard to correlate logs, traces, and metrics? When API response times cross 3 seconds, identifying whether the slowdown is at the NGINX layer, the application, or the database shouldn't require guesswork. That's where OpenTelemetry instrumentation for NGINX becomes essential.

Read Post

Last9

Read more about Monitor Nginx with OpenTelemetry Tracing

How to Set Up Real User Monitoring

Jul 21, 2025 By Anjali Udasi In Last9

Synthetic monitoring provides consistent, repeatable results, 2.1s load times, passing Lighthouse scores, and minimal variability. But those numbers reflect lab conditions. On slower networks, like 3G in Southeast Asia, real users may see much higher load times, 5.8s or more. This isn’t a fault of the tools. It’s a difference in testing context. Synthetic tests run on fast machines, stable connections, and clean environments.

Read Post

Last9

Read more about How to Set Up Real User Monitoring

Risk Register for SREs: A Practical Guide to Proactive Incident Prevention

Jul 18, 2025 By Nuno Tomas In isDown

A risk register is one of the most powerful tools in an SRE's arsenal for maintaining system reliability. By systematically documenting potential threats to your infrastructure and services, you can shift from reactive firefighting to proactive risk management.

Read Post