%term

The latest News and Information on Service Reliability Engineering and related technologies.

How SRE Practices Improve Trust in Digital Finance and Healthcare Platforms

Jul 3, 2026 By OpsMatters In OpsMatters

Trust used to be a brand problem. Now it's an uptime problem, a latency problem, a data integrity problem, and sometimes a "why is the payment button spinning again?" problem. For digital finance and healthcare platforms, users don't separate the service from the system behind it. If the app fails, the business feels careless. If records lag, confidence drops. If a transaction disappears for even a few seconds, panic arrives fast.

Read Post

OpsMatters

Read more about How SRE Practices Improve Trust in Digital Finance and Healthcare Platforms

Could vs. Should: The First Year Managing an SRE Team

Jul 2, 2026 By Reid Savage In Honeycomb

As of today, I’ve drafted this post upwards of 10 times – it’s old enough that the version I first started working on was called “Reflections on 1 Year of SRE Management” (I’m currently at 2.5 years). But everything I learned during that first year became critical for the next.

Read Post

Honeycomb

Read more about Could vs. Should: The First Year Managing an SRE Team

How QA engineers use AI to keep up with agentic development

Jun 26, 2026 By Rootly In Rootly

QA Lead at Rootly explains how she's embraced AI to keep up with the pace of AI-driven feature development.

View Video

Rootly

Read more about How QA engineers use AI to keep up with agentic development

It's always DNS, even at Cisco: behind a weeks-long incident

Jun 26, 2026 By Rootly In Rootly

SRE Lead Ricard Bejarano (Cisco) and Jorge Lainfiesta (Rootly) sit down to talk about a recent intermittent incident that had the team scratching their heads.

View Video

Rootly

Read more about It's always DNS, even at Cisco: behind a weeks-long incident

High Cardinality in ClickHouse at Scale: What Actually Breaks

Jun 25, 2026 By Prathamesh Sonpatki In Last9

ClickHouse swallows high-cardinality telemetry at ingest, then breaks at query time weeks later. Here is what fails, and how we keep it fast in production. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about High Cardinality in ClickHouse at Scale: What Actually Breaks

Klaudia Under the Hood: How We Built an AI SRE That Actually Earns Trust

Jun 18, 2026 By Asaf Savich In Komodor

In reliability engineering, being ‘mostly right’ is a liability. An AI SRE that sometimes misses the root cause or gives a confident, wrong answer at 2:17 AM has no place in an enterprise cloud environment. In this context, silence is better than noise. That’s the bar Klaudia is built to clear: genuine reliability that you can trust in production. The kind of reliability that earns a place alongside your best engineers. Getting there requires more than just a capable model.

Read Post

Komodor

Read more about Klaudia Under the Hood: How We Built an AI SRE That Actually Earns Trust

Why API Reliability Is Critical to Modern Finance

Jun 17, 2026 By OpsMatters In OpsMatters

Financial APIs power payments, compliance, and customer services. Learn why observability, monitoring, and API reliability are vital to resilience.

Read Post

OpsMatters

Read more about Why API Reliability Is Critical to Modern Finance

ClickHouse LowCardinality: When It Helps and When It Hurts

Jun 15, 2026 By Prathamesh Sonpatki In Last9

ClickHouse LowCardinality cuts storage and speeds up queries on low-cardinality columns, but backfires on trace IDs. How to tell the difference. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about ClickHouse LowCardinality: When It Helps and When It Hurts

Introducing the Rootly Agent

Jun 11, 2026 By Rootly In Rootly

During an incident, ask the Rootly Agent anything and it'll respond (and act) based on context and your data. Use the Rootly Agent to: The Rootly Agent performs actions on your behalf, so it is bound by the permissions assigned to your user. It will also ask for confirmation before taking significant actions. Rootly admins can turn it on for their workplaces and start running incidents even more efficiently.

View Video

Rootly

Read more about Introducing the Rootly Agent

Should platform, SRE, and security merge into one function?

Jun 4, 2026 By Cristina Buenahora In Cortex

Platform, SRE, and security are three distinct functions in modern engineering orgs, each shaped by a different problem. SRE was the operations function's answer to scale: how to keep systems reliable when the systems get big. Platform answered a different problem: how to let developers ship without becoming infrastructure experts. Security drew the line on what could safely reach production.

Read Post