%term

The latest News and Information on Service Reliability Engineering and related technologies.

Database Sharding: How It Works and When You Actually Need It

Feb 21, 2026 By Prathamesh Sonpatki In Last9

How database sharding works, common strategies (hash, range, directory), shard key selection, and the operational cost of running a sharded database in production. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Database Sharding: How It Works and When You Actually Need It

Database Performance Tuning: A Practical Guide for Production Systems

Feb 20, 2026 By Preeti Dewani In Last9

Tune PostgreSQL and MySQL for production with connection pooling, memory configuration, write path optimization, vacuum management, and lock contention fixes. Technical Product Manager at Last9.

Read Post

Last9

Read more about Database Performance Tuning: A Practical Guide for Production Systems

Traces Are Not Your Business Logic

Feb 19, 2026 By Mukta Aphale In Last9

Distributed traces track how your system processed a single request — not what your customers did over time. Confusing the two leads to poorly instrumented systems.

Read Post

Last9

Read more about Traces Are Not Your Business Logic

SQL Query Optimization: Techniques That Actually Improve Performance

Feb 19, 2026 By Sahil Khan In Last9

Find and fix slow SQL queries using execution plans, missing index detection, N+1 pattern fixes, and pagination strategies for PostgreSQL and MySQL. Product Marketing Manager.

Read Post

Last9

Read more about SQL Query Optimization: Techniques That Actually Improve Performance

Database Indexing: How It Works, Types, and When to Use It

Feb 18, 2026 By Faiz Shaikh In Last9

How database indexes work, when to use B-tree vs hash indexes, clustered vs non-clustered indexes, and how to tell if your indexes are actually helping.

Read Post

Last9

Read more about Database Indexing: How It Works, Types, and When to Use It

Code Is Cheap, Reliability Isn't: Owning Production in the AI era w/ Swizec Teller

Feb 16, 2026 By Rootly In Rootly

In this episode, Swizec Teller, author of the bestselling Scaling Fast, makes a bold claim: code is cheap, reliability is not. As AI coding tools accelerate feature development, the real competitive advantage shifts to operating systems reliably in production. We explore the hidden complexity of SRE work, the addictive nature of agentic coding, and why ownership — not automation — remains at the core of modern software engineering.

View Video

Rootly

Read more about Code Is Cheap, Reliability Isn't: Owning Production in the AI era w/ Swizec Teller

SRE Report: AI optimism and the economics of effort

Feb 10, 2026 By Denton Chikura In Catchpoint

For eight years, the survey behind the SRE Report has used a consistent methodology. That consistency allows us to track how reliability work evolves over time, rather than relying on snapshots. One of the most stable questions in the survey asks respondents to estimate how much of their work, on average, is spent on toil. Between 2020 and 2024, responses showed a gradual decline in reported toil.

Read Post

Catchpoint

Read more about SRE Report: AI optimism and the economics of effort

Reference architecture: The blueprint for safe and scalable autonomy in SRE and DevOps

Feb 9, 2026 By Leah Wessels In iLert

Everyone wants autonomous incident response. Most teams are building it wrong. ‍ The ultimate goal of autonomy in SRE and DevOps is the capacity of a system to not only detect incidents but to resolve them independently through intelligent self-regulation. However, true autonomy isn't born from automating random, isolated tasks. It requires a stable foundation: a Reference Architecture.

Read Post

iLert

Read more about Reference architecture: The blueprint for safe and scalable autonomy in SRE and DevOps

AI SRE in Practice: Tracing Policy Changes to Widespread Pod Failures

Feb 9, 2026 By Itiel Shwartz In Komodor

Policy changes in Kubernetes are supposed to improve security, enforce standards, or optimize resource usage. But when a policy change triggers cascading pod failures across multiple namespaces, the investigation becomes a race to identify what changed before more workloads are affected.

Read Post

Komodor

Read more about AI SRE in Practice: Tracing Policy Changes to Widespread Pod Failures

The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

Feb 5, 2026 By Udi Hofesh In Komodor

You might expect an AI-SRE agent to target 100% reliable services, ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a non-linear cost: maximizing stability limits how fast new features can be developed, dramatically increases the operational cost, and reduces the features a team can afford to offer.

Read Post