Monthly Archive

Why High-Cardinality Metrics Break Everything

Dec 31, 2025 By Prathamesh Sonpatki In Last9

High-cardinality metrics are one of those ideas that sound obviously right - until you try to use them in production. In theory, they promise precision. Instead of averages and rollups, you get specificity: per-request, per-userid, per-container, per-feature insights. The kind of detail we all immediately want when something is on fire. And then things start breaking. Not immediately. Not loudly.But quietly.

Read Post

Last9

Read more about Why High-Cardinality Metrics Break Everything

7 Kubernetes Predictions for 2026 - AI Will Push SRE to its Limit

Dec 29, 2025 By Itiel Shwartz In Komodor

As AI workloads shift from training to massive-scale inference, SRE teams are about to feel even more pressure. GPU-heavy computing is breaking the assumptions today’s clusters were built on, while enterprises are beginning to trust autonomous operations and cost pressure is pushing consolidation across the cloud-infrastructure stack.

Read Post

Komodor

Read more about 7 Kubernetes Predictions for 2026 - AI Will Push SRE to its Limit

Blameless Postmortem: Foundation of Site Reliability

Dec 23, 2025 By Nuno Tomas In isDown

When systems fail, the instinct to find someone to blame runs deep. But what if assigning fault actually makes your systems less reliable? A blameless postmortem culture transforms how teams learn from incidents, creating stronger systems and more effective incident response processes.

Read Post

isDown

Read more about Blameless Postmortem: Foundation of Site Reliability

Platform Engineering: Error Budgets Explained Simply #shorts

Dec 23, 2025 By Last9 - Monitoring for AI Native SDLC In Last9

Platform engineering provides powerful tools that handle a lot under the hood. Learn how to calculate your remaining error budget with a simple formula using real numbers and objective statements.

View Video

Last9

Read more about Platform Engineering: Error Budgets Explained Simply #shorts

Implementing SLOs: Our Scale Mistakes and Successes #shorts

Dec 23, 2025 By Last9 - Monitoring for AI Native SDLC In Last9

30 minutes of eating crow! Learn from our SLO mistakes at Weave. Discover pitfalls and shortcuts to doing it right the first time. Avoid our wrong, wrong, wrong, wrongs!

View Video

Last9

Read more about Implementing SLOs: Our Scale Mistakes and Successes #shorts

OpenTelemetry Metrics: Traces, Logs & Prometheus Integration #shorts

Dec 23, 2025 By Last9 - Monitoring for AI Native SDLC In Last9

OpenTelemetry aims to link metrics to traces and logs, offering OpenCensus users a seamless migration path. Work with existing protocols like Prometheus. Leverage existing tooling without learning something completely new.

View Video

Last9

Read more about OpenTelemetry Metrics: Traces, Logs & Prometheus Integration #shorts

OpenTelemetry: Components, SDKs, and Middleware Explained #shorts

Dec 23, 2025 By Last9 - Monitoring for AI Native SDLC In Last9

OpenTelemetry explained: standards, SDKs for various languages (Ruby, Python, Go), and middleware tools. Deploy these to pre-process data and send it to your destination.

View Video

Last9

Read more about OpenTelemetry: Components, SDKs, and Middleware Explained #shorts

OTel Updates: OpenTelemetry Deprecates Zipkin Exporters

Dec 22, 2025 By Anjali Udasi In Last9

OpenTelemetry is deprecating the Zipkin exporter specification. Zipkin now supports OTLP ingestion natively, so the custom exporter logic in OTel SDKs is no longer necessary.

Read Post

Last9

Read more about OTel Updates: OpenTelemetry Deprecates Zipkin Exporters

99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Not Diamond

Dec 22, 2025 By Rootly In Rootly

Shipping systems powered by LLMs would be hard enough if the models stayed the same. But in reality, they don’t. Models get updated and deprecated at a pace traditional software wouldn’t. All while teams are still expected to hit reliability targets that look a lot like traditional SLAs.

View Video

Rootly

Read more about 99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Not Diamond

Last9 integration with TrueFoundry AI Gateway

Dec 18, 2025 By Sahil Khan In Last9

If you're using TrueFoundry to manage your LLM traffic, you can now send those traces directly to Last9 and view them alongside your existing infrastructure telemetry.

Read Post

Last9

Read more about Last9 integration with TrueFoundry AI Gateway

How agentic IT operations lay the foundations for SRE success at scale

Dec 15, 2025 By Manish Agarwal In BigPanda

When something breaks in a modern digital service, customers feel it instantly. Pages stall, requests time out, and carts are abandoned, while frustration grows long before a root cause is identified. What the world never sees is the engineering effort required to keep these systems healthy in the first place. Site Reliability Engineers (SREs) carry that responsibility every day.

Read Post

BigPanda

Read more about How agentic IT operations lay the foundations for SRE success at scale

How to Handle Cloud Monitoring Overload?

Dec 12, 2025 By Anjali Udasi In Last9

Reduce alert noise by 70% through intelligent aggregation, clear ownership boundaries, and filtering metrics that don't map to user-facing issues. Monitoring starts with a straightforward goal: understand your system's health and identify issues before users notice them. You set up metrics, create dashboards, and configure some alerts. At first, it works well. Over time, your stack gets bigger and more complicated. New services get added.

Read Post

Last9

Read more about How to Handle Cloud Monitoring Overload?

The Reality of GenAI in Production with Eduardo Ordax (AWS)

Dec 12, 2025 By Rootly In Rootly

GenAI demos are easy. Production is where everything breaks. In this episode, Eduardo Ordax, Principal GTM GenAI at AWS, breaks down what actually stops companies from shipping reliable AI systems, and why the real blockers have little to do with technology.

View Video

Rootly

Read more about The Reality of GenAI in Production with Eduardo Ordax (AWS)

OTel Updates: OpenTelemetry Proposes Changes to Stability, Releases, and Semantic Conventions

Dec 12, 2025 By Anjali Udasi In Last9

Over the past year, the Governance Committee ran user interviews and surveys with organizations deploying OpenTelemetry at scale. A few patterns came up consistently: Stability levels aren't always obvious. When you install an OTel distribution, some components might be experimental or alpha without clear markers. This makes it harder to evaluate what's production-ready. Instrumentation libraries sometimes wait on semantic conventions.

Read Post

Last9

Read more about OTel Updates: OpenTelemetry Proposes Changes to Stability, Releases, and Semantic Conventions

The War Room of AI Agents: Why the Future of AI SRE is Multi-Agent Orchestration

Dec 11, 2025 By Itiel Shwartz In Komodor

We’ve all been there. It’s 2 AM, your phone is buzzing with alerts, and you’re suddenly thrust into an incident war room with a dozen other bleary-eyed engineers. The production environment is on fire, customers are affected, and everyone’s trying to piece together what went wrong. But here’s what makes these moments fascinating from a systems perspective – it’s rarely just one person silently fixing the issue in isolation.

Read Post

Komodor

Read more about The War Room of AI Agents: Why the Future of AI SRE is Multi-Agent Orchestration

How to Track Down the Real Cause of Sudden Latency Spikes

Dec 9, 2025 By Anjali Udasi In Last9

Start with distributed tracing to find which service is slow, then use continuous profiling to see why the code is slow, and finally apply high-cardinality analysis to identify which users or conditions trigger the problem. It's 2 AM. Your phone buzzes. Users are reporting timeouts. The metrics dashboard shows p99 latency spiking from 200ms to 4 seconds, but everything looks normal—CPU at 60%, memory stable, no error spikes. A quick pod restart helps briefly, then latency climbs right back up.

Read Post

Last9

Read more about How to Track Down the Real Cause of Sudden Latency Spikes

New features: AI SRE, Merge alerts, and Status pages for thousands of services

Dec 8, 2025 By Daria Yankevich In iLert

As we head into the holiday season, the ilert team is doing the opposite of slowing down; we’re ramping up. Over the past weeks, we’ve shipped a wave of impactful improvements across alerting, AI-powered automation, mobile app, and status pages. From major upgrades that reshape how teams triage incidents to smaller refinements that remove daily friction, this release is packed with updates designed to make on-call and operations smoother, smarter, and faster. Let’s dive in.

Read Post

iLert

Read more about New features: AI SRE, Merge alerts, and Status pages for thousands of services

Komodor - The Autonomous AI SRE Platform

Dec 8, 2025 By Komodor In Komodor

Komodor is the leading Autonomous AI SRE Platform for cloud native infrastructure and operations. Powered by Klaudia Agentic AI, Komodor automatically visualizes, troubleshoots, and optimizes Kubernetes-based platforms at scale.

View Video

Komodor

Read more about Komodor - The Autonomous AI SRE Platform

Which Observability Tool Helps with Visibility Without Overspend

Dec 5, 2025 By Anjali Udasi In Last9

If you’re trying to control observability spend without cutting visibility, the platforms that usually offer the best cost balance at enterprise scale are Last9, Grafana Cloud, Elastic, and Chronosphere — depending on the shape of your telemetry and the level of operational ownership you want.

Read Post

Last9

Read more about Which Observability Tool Helps with Visibility Without Overspend

Bits AI SRE, our first AI agent, now generally available! #datadog

Dec 4, 2025 By Datadog In Datadog

We introduced Bits AI SRE, our first AI agent, now generally available. Across industries, customers of all sizes are already seeing faster resolution, stronger reliability, and a better on-call experience for their teams.

View Video

Datadog

Read more about Bits AI SRE, our first AI agent, now generally available! #datadog

OTel Updates: Unroll Processor Now in Collector Contrib

Dec 4, 2025 By Anjali Udasi In Last9

Some log sources bundle multiple events into a single record before shipping them. This is common with VPC flow logs, CloudWatch exports, and certain Windows endpoint collectors. While this batching approach is efficient for transport, it creates challenges when you need to filter, search, or correlate individual events. When a log record contains an array of 47 events, your analytics tool sees one entry instead of 47 distinct records.

Read Post

Last9

Read more about OTel Updates: Unroll Processor Now in Collector Contrib

Cost Optimization Is Now Part of the SRE Playbook

Dec 4, 2025 By Itiel Shwartz In Komodor

In the era of cloud-native architectures, Site Reliability Engineering (SRE) has matured from a discipline focused purely on uptime to a sophisticated practice of efficient reliability. The key driver for this evolution is an undeniable truth: cloud spend has become intrinsically linked to system stability.

Read Post

Komodor

Read more about Cost Optimization Is Now Part of the SRE Playbook

Gemini 3 beaks OpenAI's long-standing lead in SRE tasks

Dec 4, 2025 By Rootly In Rootly

A major shift just hit SRE-focused AI. Gemini 3 Pro edged out OpenAI’s models and outperformed them across every single SRE task we tested. In this Rootly AI Labs episode, Sylvain Kalache and Laurence Liang break down.

View Video

Rootly

Read more about Gemini 3 beaks OpenAI's long-standing lead in SRE tasks

The hidden costs of immature incident management #sre #devops

Dec 3, 2025 By Rootly In Rootly

Learn more: https://rootly.com/blog/the-hidden-costs-of-immature-incident-management

View Video

Rootly

Read more about The hidden costs of immature incident management #sre #devops

Datadog Bits AI SRE: Your new teammate for on-call shifts

Dec 2, 2025 By Datadog In Datadog

Bits AI SRE is an always-on SRE agent built to handle complex troubleshooting and late-night alerts. Developed against thousands of real-world incidents and powered by Datadog’s platform, Bits AI SRE analyzes your entire stack, tests hypotheses, and identifies root causes in minutes. Resolve faster, get back to sleep sooner, and give your on-call team the confidence and capacity they need.

View Video

Datadog

Read more about Datadog Bits AI SRE: Your new teammate for on-call shifts

9 Monitoring Tools That Deliver AI-Native Anomaly Detection

Dec 1, 2025 By Anjali Udasi In Last9

The observability market has moved beyond manual threshold-setting. Modern platforms use statistical algorithms, machine learning, and causal AI to detect anomalies automatically. Some work immediately after deployment. Others train on your data for better accuracy. Each approach has technical trade-offs worth understanding. This guide compares how nine monitoring solutions handle automated anomaly detection and root cause analysis.

Read Post

Last9

Read more about 9 Monitoring Tools That Deliver AI-Native Anomaly Detection

Operations | Monitoring | ITSM | DevOps | Cloud

Why High-Cardinality Metrics Break Everything

7 Kubernetes Predictions for 2026 - AI Will Push SRE to its Limit

Blameless Postmortem: Foundation of Site Reliability

Platform Engineering: Error Budgets Explained Simply #shorts

Implementing SLOs: Our Scale Mistakes and Successes #shorts

OpenTelemetry Metrics: Traces, Logs & Prometheus Integration #shorts

OpenTelemetry: Components, SDKs, and Middleware Explained #shorts

OTel Updates: OpenTelemetry Deprecates Zipkin Exporters

99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Not Diamond

Last9 integration with TrueFoundry AI Gateway

How agentic IT operations lay the foundations for SRE success at scale

How to Handle Cloud Monitoring Overload?

The Reality of GenAI in Production with Eduardo Ordax (AWS)

OTel Updates: OpenTelemetry Proposes Changes to Stability, Releases, and Semantic Conventions

The War Room of AI Agents: Why the Future of AI SRE is Multi-Agent Orchestration

How to Track Down the Real Cause of Sudden Latency Spikes

New features: AI SRE, Merge alerts, and Status pages for thousands of services

Komodor - The Autonomous AI SRE Platform

Which Observability Tool Helps with Visibility Without Overspend

Bits AI SRE, our first AI agent, now generally available! #datadog

OTel Updates: Unroll Processor Now in Collector Contrib

Cost Optimization Is Now Part of the SRE Playbook

Gemini 3 beaks OpenAI's long-standing lead in SRE tasks

The hidden costs of immature incident management #sre #devops

Datadog Bits AI SRE: Your new teammate for on-call shifts

9 Monitoring Tools That Deliver AI-Native Anomaly Detection

Monthly Archive

Follow Us