Monthly Archive

How to Test SQS Workflows Locally with LocalStack and OpenTelemetry

Apr 30, 2026 By Prathamesh Sonpatki In Last9

LocalStack lets you run SQS, Lambda, and S3 locally in Docker — but there's a hidden trap: OpenTelemetry's default AWS propagator doesn't work with free LocalStack. Here's how to set up end-to-end local testing with working trace propagation. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about How to Test SQS Workflows Locally with LocalStack and OpenTelemetry

How to use an SRE agent to reduce downtime

Apr 30, 2026 By Sam Chun In PagerDuty

An alert in the middle of the night warns of a potential business failure. Manual incident response becomes more complex due to the overwhelming data from distributed and dynamic digital services. With an SRE agent, your engineering team can cut through alert clutter. They can sort through various signals quicker, decreasing burnout and achieving faster, more affordable resolutions. Operational resilience will see its next evolution with Agentic AI.

Read Post

PagerDuty

Read more about How to use an SRE agent to reduce downtime

End-to-End Trace Propagation Across SQS and Lambda with OpenTelemetry

Apr 29, 2026 By Prathamesh Sonpatki In Last9

SQS doesn't propagate trace context automatically. You instrument both sides, deploy, and get two disconnected traces. This post shows how to wire them into one waterfall — and the ESM format gotcha that silently breaks it every time. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about End-to-End Trace Propagation Across SQS and Lambda with OpenTelemetry

last9-genai: Closing the Conversation Gap in LLM Observability

Apr 28, 2026 By Prathamesh Sonpatki In Last9

OpenTelemetry's GenAI instrumentation gives you spans and token counts. It does not give you conversations, workflow cost rollups, or prompts visible in your dashboard. last9-genai is an OTel extension that fills those three gaps — without replacing your existing observability stack. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about last9-genai: Closing the Conversation Gap in LLM Observability

How to Exclude Health Check Endpoints from Python OTel Traces

Apr 28, 2026 By Prathamesh Sonpatki In Last9

Health check endpoints generate thousands of identical, useless spans per day. Here are two production-ready approaches to filter them from your Python OTel traces — and the correctness trap most implementations miss. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about How to Exclude Health Check Endpoints from Python OTel Traces

Argo Rollouts Canary Monitoring: Metrics, Gotchas, and Automated Gates with Last9

Apr 27, 2026 By Prathamesh Sonpatki In Last9

Argo Rollouts exposes Prometheus metrics on port 8090 — but the docs lie about which labels exist. Here's how to scrape them into Last9, build a canary dashboard, and use Last9 as an automated AnalysisTemplate gate, including the auth and base64 gotchas. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Argo Rollouts Canary Monitoring: Metrics, Gotchas, and Automated Gates with Last9

SRE agent vs. traditional engineer: 7 key differences

Apr 27, 2026 By Sam Chun In PagerDuty

The role of a Site Reliability Engineer (SRE) is evolving. The focus has shifted from simply working harder during an outage; A new kind of teammate is here to help: the SRE Agent. But what are the key differences when you compare an SRE agent versus a traditional site reliability engineer? This isn’t just a superficial change. It signifies a fundamental alteration in how teams construct and sustain dependable services.

Read Post

PagerDuty

Read more about SRE agent vs. traditional engineer: 7 key differences

What is AI SRE? The Complete Guide to AI-Assisted Site Reliability Engineering

Apr 26, 2026 By Prathamesh Sonpatki In Last9

It's 2:47 AM. PagerDuty fires. You open a Slack alert and see: p99 latency spike on checkout-service. You SSH into the host, check dashboards in four tabs, grep logs for the last 20 minutes, and eventually find a slow query introduced in a deploy six hours ago. It took 34 minutes. You resolved it, w Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about What is AI SRE? The Complete Guide to AI-Assisted Site Reliability Engineering

Capturing HTTP Request and Response Bodies in .NET Traces with PHI Redaction

Apr 25, 2026 By Prathamesh Sonpatki In Last9

> Standard OTel.NET instrumentation captures headers, status codes, and timing — not request or response bodies. Here's how to add body capture to your traces while keeping PHI out of your observability backend. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Capturing HTTP Request and Response Bodies in .NET Traces with PHI Redaction

Fixing Broken Traces in GCP Cloud Run: A Custom OpenTelemetry Propagator

Apr 24, 2026 By Prathamesh Sonpatki In Last9

GCP's load balancer silently rewrites your traceparent header, orphaning spans in any OTLP backend. Here's the custom propagator that fixes it. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Fixing Broken Traces in GCP Cloud Run: A Custom OpenTelemetry Propagator

How it feels to run an incident with AI SRE

Apr 23, 2026 By Article In Incident.io

We've been building the broader incident.io platform for several years now, and one thing we've learned is that UX matters more here than almost anywhere else. When an incident fires, there's no room for poorly designed interfaces or fumbling through features you haven't touched in a while. The product has to be ergonomic: easy to pick up, easy to navigate, with the right things at your fingertips at exactly the right moment. We've put a lot of effort into this over the last 5 years.

Read Post

Incident.io

Read more about How it feels to run an incident with AI SRE

Why Your PromQL Availability Query Returns Nothing When Services Are Healthy

Apr 23, 2026 By Prathamesh Sonpatki In Last9

Your SLI query shows 100% availability as No Data. Here's why PromQL returns empty results instead of zero — and the label-preserving fix. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Why Your PromQL Availability Query Returns Nothing When Services Are Healthy

From GPU Silicon to Business Metrics: The 8 Layers of GPU Observability

Apr 21, 2026 By Shekhar In Last9

GPU observability isn't one thing - it's eight connected layers from silicon to cost. See why correlation across layers is what cuts debugging from 2 hours to 2 minutes, and why most teams instrument only one or two.

Read Post

Last9

Read more about From GPU Silicon to Business Metrics: The 8 Layers of GPU Observability

Instrumenting WordPress with OpenTelemetry: PHP Tracing, Browser RUM, and Error Capture in Production

Apr 21, 2026 By Prathamesh Sonpatki In Last9

WordPress powers 40% of the web but has no native observability story. Here's how to instrument it end-to-end with OpenTelemetry - PHP, browser RUM, and errors. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Instrumenting WordPress with OpenTelemetry: PHP Tracing, Browser RUM, and Error Capture in Production

10,000 GPUs, One TSDB: Cardinality at GPU Scale

Apr 21, 2026 By Shekhar In Last9

1,000 nodes × 8 GPUs × 60 metrics = 1.4M time series - before you add pod names or Slurm job IDs. GPU monitoring is a cardinality problem disguised as a metrics problem. How to design for it before production OOMs your Prometheus.

Read Post

Last9

Read more about 10,000 GPUs, One TSDB: Cardinality at GPU Scale

How to solve key site reliability engineering challenges

Apr 20, 2026 By Lightrun Team In Lightrun

Modern site reliability engineering challenges stem from the difficult requirement of confirming why complex systems fail in ways staging cannot replicate. While observability tools signal failures, and AI SREs reason over data, they leave observability gaps regarding the actual state of running code. By utilizing runtime context, teams capture live execution data to accelerate production debugging, resolving incidents in minutes without requiring manual redeploy cycles.

Read Post

Lightrun

Read more about How to solve key site reliability engineering challenges

The GPU Metrics That Actually Matter

Apr 20, 2026 By Shekhar In Last9

Most teams monitor three GPU metrics - utilization, temperature, memory. There are 50+ that matter, and the ones you skip cause your worst outages. A vendor-neutral guide across NVIDIA, AMD, and Intel Gaudi.

Read Post

Last9

Read more about The GPU Metrics That Actually Matter

Your LLM Is Slower Than You Think

Apr 19, 2026 By Shekhar In Last9

60% GPU utilization and 3-second response times? GPU utilization is the wrong signal for LLM inference. Here's why TTFT, KV-cache pressure, and queue depth - not utilization - predict user-facing latency.

Read Post

Last9

Read more about Your LLM Is Slower Than You Think

Predicting GPU Failures Before They Cost You

Apr 18, 2026 By Shekhar In Last9

Predict GPU hardware failures 48–72 hours in advance. A guide to the five rate-based signals — ECC error trends, XID events, thermal ramp, row remap exhaustion, PCIe downtraining — and how to combine them into a composite health score.

Read Post

Last9

Read more about Predicting GPU Failures Before They Cost You

Every Token Has a Price: Per-Request GPU Cost Attribution

Apr 17, 2026 By Shekhar In Last9

Flat per-token pricing is wrong by 10–50× per request. Prefill vs decode, batch sharing, and cache effects break the math. How to attribute real GPU cost - compute, energy, and dollars - to each inference request.

Read Post

Last9

Read more about Every Token Has a Price: Per-Request GPU Cost Attribution

In the Age of AI, Taste Isn't About Aesthetics

Apr 16, 2026 By Rootly In Rootly

AI can generate a UI in seconds. So what do designers actually bring to the table? Marcela, Principal Product Designer at Rootly and former Founding Designer at Ramp, has spent 20 years in design. Her answer: taste isn't about aesthetics or crafting pleasant interactions. It's about asking the uncomfortable questions, and choosing the right problem, not the easiest one.

View Video

Rootly

Read more about In the Age of AI, Taste Isn't About Aesthetics

What Is an AI SRE? And Why Do They Need Live Runtime Evidence?

Apr 15, 2026 By Lightrun Team In Lightrun

AI SREs are autonomous systems that handle incident triage, root cause analysis, and remediation by correlating logs, metrics, traces, and code signals. However, as they rely on pre-configured telemetry, the critical execution details of a specific failure, such as variable state and code paths, can often be missed. As a result, they either force users into manual redeploy loops or make inferences from partial data, diagnosing issues using probability rather than proof.

Read Post

Lightrun

Read more about What Is an AI SRE? And Why Do They Need Live Runtime Evidence?

Site Reliability Engineering (SRE) 101: Everything You Need to Know | Harness Blog

Apr 15, 2026 By Eric Minick In Harness

A single second of latency can cost e-commerce sites millions in revenue, while just minutes of downtime trigger customer churn that takes months to recover. Modern users expect instant responses and seamless experiences, making reliability a competitive feature that directly impacts business outcomes. Site Reliability Engineering treats operations as a software problem rather than a manual discipline. SRE applies engineering principles to achieve measurable reliability through automation.

Read Post

Harness

Read more about Site Reliability Engineering (SRE) 101: Everything You Need to Know | Harness Blog

Top 6 AI SRE Tools and Why Runtime-Grounded Reliability Is the New Standard

Apr 13, 2026 By Lightrun Team In Lightrun

AI SRE tools accelerate incident detection, root cause analysis, and remediation across distributed production systems. They ingest telemetry signals, including logs, metrics, traces, alerts, and deployment history, to correlate anomalies, narrow fault domains, and reduce manual triage. This guide breaks down the top AI SRE tools in 2026 and helps you choose the right one based on your team’s biggest bottleneck, whether that is faster triage, deeper root cause analysis, or runtime-level validation.

Read Post

Lightrun

Read more about Top 6 AI SRE Tools and Why Runtime-Grounded Reliability Is the New Standard

Komodor Provides Autonomous AI SRE Troubleshooting for ClusterAPI

Apr 9, 2026 By Asaf Savich In Komodor

Cluster API (CAPI) is transforming how organizations deploy and manage fleets of Kubernetes clusters by introducing declarative, Kubernetes-style APIs to automate cluster provisioning and lifecycle management. While CAPI excels at creating consistent and repeatable cluster deployments across different infrastructure providers, operating it at a massive scale introduces unique day-to-day challenges.

Read Post

Komodor

Read more about Komodor Provides Autonomous AI SRE Troubleshooting for ClusterAPI

AI Didn't Change the Game, It Just Exposed Your Bottlenecks w/ Ganesh Datta (CTO, Cortex)

Apr 9, 2026 By Rootly In Rootly

Every engineering org says they want to improve reliability — but most can't even agree on what "good" looks like. Ganesh Datta, Co-Founder and CTO of Cortex, has spent the better part of a decade helping companies confront that gap.

View Video

Rootly

Read more about AI Didn't Change the Game, It Just Exposed Your Bottlenecks w/ Ganesh Datta (CTO, Cortex)

KubeCon + CloudNativeCon EU 2026: What We Learned About AI, Observability, and Fast Feedback Loops

Apr 3, 2026 By Abdullah Chowdhury In Honeycomb

Honeycomb was excited to attend KubeCon + CloudNativeCon Europe, where one theme stood out across sessions: as AI reshapes how software is built and run, teams are being pushed to rethink how they understand their systems. Without strong observability and feedback loops, AI can accelerate confusion, misalignment, and operational risk.

Read Post

Honeycomb

Read more about KubeCon + CloudNativeCon EU 2026: What We Learned About AI, Observability, and Fast Feedback Loops

Operations | Monitoring | ITSM | DevOps | Cloud

How to Test SQS Workflows Locally with LocalStack and OpenTelemetry

How to use an SRE agent to reduce downtime

End-to-End Trace Propagation Across SQS and Lambda with OpenTelemetry

last9-genai: Closing the Conversation Gap in LLM Observability

How to Exclude Health Check Endpoints from Python OTel Traces

Argo Rollouts Canary Monitoring: Metrics, Gotchas, and Automated Gates with Last9

SRE agent vs. traditional engineer: 7 key differences

What is AI SRE? The Complete Guide to AI-Assisted Site Reliability Engineering

Capturing HTTP Request and Response Bodies in .NET Traces with PHI Redaction

Fixing Broken Traces in GCP Cloud Run: A Custom OpenTelemetry Propagator

How it feels to run an incident with AI SRE

Why Your PromQL Availability Query Returns Nothing When Services Are Healthy

From GPU Silicon to Business Metrics: The 8 Layers of GPU Observability

Instrumenting WordPress with OpenTelemetry: PHP Tracing, Browser RUM, and Error Capture in Production

10,000 GPUs, One TSDB: Cardinality at GPU Scale

How to solve key site reliability engineering challenges

The GPU Metrics That Actually Matter

Your LLM Is Slower Than You Think

Predicting GPU Failures Before They Cost You

Every Token Has a Price: Per-Request GPU Cost Attribution

In the Age of AI, Taste Isn't About Aesthetics

What Is an AI SRE? And Why Do They Need Live Runtime Evidence?

Site Reliability Engineering (SRE) 101: Everything You Need to Know | Harness Blog

Top 6 AI SRE Tools and Why Runtime-Grounded Reliability Is the New Standard

Komodor Provides Autonomous AI SRE Troubleshooting for ClusterAPI

AI Didn't Change the Game, It Just Exposed Your Bottlenecks w/ Ganesh Datta (CTO, Cortex)

KubeCon + CloudNativeCon EU 2026: What We Learned About AI, Observability, and Fast Feedback Loops

Monthly Archive

Follow Us