Monthly Archive

Top 15 Application Performance Metrics for Developers and SREs in 2026

Jan 30, 2026 By Mohana Ayeswariya J In Atatus

Every application tells a story of user intent, system behavior, and business impact. To truly understand how your application performs, you need to go beyond logs and errors. You need metrics that provide actionable visibility across your stack. Application performance metrics are the foundation for delivering high-quality digital experiences, and they empower DevOps teams, developers, engineers, and site reliability engineers (SREs) to respond faster, scale smarter, and continuously improve.

Read Post

Atatus

Read more about Top 15 Application Performance Metrics for Developers and SREs in 2026

Keeping Frontier Models Reliable at Mistral AI with Rootly

Jan 30, 2026 By Rootly In Rootly

View Video

Rootly

Read more about Keeping Frontier Models Reliable at Mistral AI with Rootly

AI SRE in Practice: Resolving Node Termination Events at Scale

Jan 25, 2026 By Itiel Shwartz In Komodor

When a node terminates unexpectedly in a Kubernetes cluster, the immediate symptoms are obvious. Workloads restart elsewhere, services experience partial outages, and alerts fire across multiple systems. The harder question is why it happened and how to prevent it from recurring. This scenario walks through a node termination event where the entire node pool was affected, requiring investigation across infrastructure layers to identify root cause and implement lasting remediation.

Read Post

Komodor

Read more about AI SRE in Practice: Resolving Node Termination Events at Scale

Stop Flying Blind: Synthetic Monitoring, Host heat-maps, and Process-Level Visibility

Jan 23, 2026 By Nishant Modak In Last9

January 2026 Release Here's a dirty secret about observability: most teams find out about outages from their customers. Not from their dashboards. Not from their alerts. From angry tweets and support tickets. The excuse is always the same: "We have metrics! We have dashboards! We even have that AI thing now!" And yet, somehow, your checkout endpoint has been returning 502s for forty-five minutes and you're learning about it from the VP of Sales who just got off a call with your biggest customer.

Read Post

Last9

Read more about Stop Flying Blind: Synthetic Monitoring, Host heat-maps, and Process-Level Visibility

The SRE Report 2026: Defensible Ns

Jan 22, 2026 By Leo Vasiliou In Catchpoint

You shouldn’t have to understand the care behind this report, unless it’s missing. For the past eight years, this research has focused on all things related to reliability and resilience. How systems behave under stress. How teams respond when things break. And how the practices continue to evolve. Reaching the eighth edition of The SRE Report attests to that and gives me pause. You can read the full report here and you can find a summary of the key findings here.

Read Post

Catchpoint

Read more about The SRE Report 2026: Defensible Ns

SRE Report 2026: What surprised us, what didn't, and why the gaps matter most

Jan 22, 2026 By Denton Chikura In Catchpoint

This is the eighth edition of the SRE Report. Eight years of tracing reliability's arc, from uptime obsession to experience, from toil to intelligence, from systems to people. This year's report is also the first since Catchpoint joined LogicMonitor. We want to acknowledge their support in keeping this work going. They get what this report means to the reliability community, and that matters. We made a deliberate choice this year to say less.

Read Post

Catchpoint

Read more about SRE Report 2026: What surprised us, what didn't, and why the gaps matter most

AI SRE Update: Your Feedback Shaped Our Latest Release

Jan 20, 2026 By Mezmo In Mezmo

A note from Lauren Nagel, Mezmo's VP of Product: At Mezmo, we believe the best observability tools aren't just built for users, they're built with them. Since the launch of Mezmo's AI SRE agent, we've listened and learned from our customers. The feedback and insights have been invaluable in helping our teams refine and enhance the experience. Today, we're excited to share our latest release, packed with improvements and powerful new capabilities that make our AI SRE even faster and more intuitive.

Read Post

Mezmo

Read more about AI SRE Update: Your Feedback Shaped Our Latest Release

High Cardinality Metrics: How Prometheus and ClickHouse Handle Scale

Jan 19, 2026 By Aditya Godbole , In Last9

TL;DR: Prometheus pays cardinality costs at write time (memory, index). ClickHouse pays at query time (aggregation memory). Neither is "better":they fail differently. Design your pipeline knowing which failure mode you're accepting. -- Every month, someone posts "just use ClickHouse for metrics" or "Prometheus can't handle scale." Both statements contain a kernel of truth wrapped in dangerous oversimplification.

Read Post

Last9

Read more about High Cardinality Metrics: How Prometheus and ClickHouse Handle Scale

AI SRE in Practice: Diagnosing Configuration Drift in Deployment Failures

Jan 18, 2026 By Itiel Shwartz In Komodor

Deployments fail for dozens of reasons. Most of them are obvious from the error messages or pod events. But when a deployment rolls out successfully according to Kubernetes but your application starts experiencing latency spikes and error rate increases, the investigation becomes significantly harder. This scenario walks through a configuration drift incident where the deployment appeared healthy but available replicas were constantly flapping, creating cascading reliability issues.

Read Post

Komodor

Read more about AI SRE in Practice: Diagnosing Configuration Drift in Deployment Failures

Democratizing Reliability: Giving Non-Engineers Real Operational Power with Dileshni Jayasinghe

Jan 14, 2026 By Rootly In Rootly

Many companies don’t invest in incident management until something goes wrong. commonsku took a different path. In this episode of Humans of Reliability, Sylvain sits down with Dileshni Jayasingha, VP of Technology at commonsku, to talk about what it really takes to introduce incident management in a mature, profitable SaaS that had never formalized it. From rolling out observability and incident tooling to practicing internal status updates before going public, Dileshni shares how her team built the right muscles before they were forced to.

View Video

Rootly

Read more about Democratizing Reliability: Giving Non-Engineers Real Operational Power with Dileshni Jayasinghe

How we built an AI SRE agent that investigates like a team of engineers

Jan 12, 2026 By Daniel Shan In Datadog

We built Bits AI SRE to help engineers investigate and solve production incidents, one of the most difficult aspects of operating distributed systems today. As environments grow more dynamic and complex, resolving issues becomes more challenging. Failures now span more services, involve noisier signals, and encompass larger volumes of telemetry data, making it hard for on-call engineers to find root causes quickly. Today, Bits AI SRE is already helping teams decrease time to resolution by up to 95%.

Read Post

Datadog

Read more about How we built an AI SRE agent that investigates like a team of engineers

AI SRE in Practice: Resolving GPU Hardware Failures in Seconds

Jan 11, 2026 By Itiel Shwartz In Komodor

When a pod fails during a TensorFlow training job, the investigation usually starts with the obvious questions. The answers rarely come quickly, especially when the failure involves GPU hardware that most engineers don’t troubleshoot regularly. This scenario walks through an actual GPU hardware failure and shows how AI-augmented investigation changes both the time to resolution and the expertise required to handle it.

Read Post

Komodor

Read more about AI SRE in Practice: Resolving GPU Hardware Failures in Seconds

When is it ok or not ok to trust AI SRE with your production reliability?

Jan 8, 2026 By Ilan Adler In Komodor

There’s a moment every engineer knows. An AI suggests a fix, it looks reasonable,maybe even obvious, but production is on the line and you hesitate before clicking execute. There’s a big difference between an AI that can recommend an action and one you’re willing to let take that action. All it takes is one bad call, one kubectl command that makes things worse, and suddenly every automated suggestion is a potential liability instead of a help.

Read Post

Komodor

Read more about When is it ok or not ok to trust AI SRE with your production reliability?

From Promise to Practice: What Real AI SRE Can Actually Do When Production Breaks

Jan 4, 2026 By Itiel Shwartz In Komodor

We’ve written before about the advantages of training an AI SRE on real telemetry data rather than generic Kubernetes documentation. We’ve explained why RAG augmentation based on actual high-scale workload patterns produces better results than LLMs trained on generic scenarios or forum threads. The theory makes sense, the architecture is sound, and the approach is defensible.

Read Post

Komodor

Read more about From Promise to Practice: What Real AI SRE Can Actually Do When Production Breaks

Podman vs Docker 2026: Security, Performance & Which to Choose

Jan 2, 2026 By Anjali Udasi In Last9

When it comes to containerization technologies, Podman and Docker are the two giants that often come up in conversation. Both have revolutionized how we build, deploy, and manage containers, but what sets them apart? In this blog, we'll dive deep into a side-by-side comparison of Podman and Docker. We'll cover everything from architecture to security, performance, and compatibility.

Read Post

Last9

Read more about Podman vs Docker 2026: Security, Performance & Which to Choose

Datadog Pricing 2026: Full Cost Breakdown + How to Save 40-90%

Jan 2, 2026 By Anjali Udasi In Last9

When it comes to monitoring and observability tools, Datadog is often one of the first names that comes to mind. But while Datadog’s features are widely discussed, its pricing often remains a topic of confusion. How much does Datadog cost, and what factors influence your bill? This guide breaks down Datadog pricing to help you better understand its structure, hidden nuances, and whether it’s the right fit for your needs.

Read Post

Last9

Read more about Datadog Pricing 2026: Full Cost Breakdown + How to Save 40-90%

Operations | Monitoring | ITSM | DevOps | Cloud

Top 15 Application Performance Metrics for Developers and SREs in 2026

Keeping Frontier Models Reliable at Mistral AI with Rootly

AI SRE in Practice: Resolving Node Termination Events at Scale

Stop Flying Blind: Synthetic Monitoring, Host heat-maps, and Process-Level Visibility

The SRE Report 2026: Defensible Ns

SRE Report 2026: What surprised us, what didn't, and why the gaps matter most

AI SRE Update: Your Feedback Shaped Our Latest Release

High Cardinality Metrics: How Prometheus and ClickHouse Handle Scale

AI SRE in Practice: Diagnosing Configuration Drift in Deployment Failures

Democratizing Reliability: Giving Non-Engineers Real Operational Power with Dileshni Jayasinghe

How we built an AI SRE agent that investigates like a team of engineers

AI SRE in Practice: Resolving GPU Hardware Failures in Seconds

When is it ok or not ok to trust AI SRE with your production reliability?

From Promise to Practice: What Real AI SRE Can Actually Do When Production Breaks

Podman vs Docker 2026: Security, Performance & Which to Choose

Datadog Pricing 2026: Full Cost Breakdown + How to Save 40-90%

Monthly Archive

Follow Us