Monthly Archive

Lightrun Launches Industry's First AI SRE With Live Dynamic Runtime Context

Feb 25, 2026 By Lightrun In Lightrun

Autonomously Remediates Software Issues, Generates Missing Runtime Evidence on Demand, and Validates Hypotheses Against Live Execution from Code to Production.

Read Post

Lightrun

Read more about Lightrun Launches Industry's First AI SRE With Live Dynamic Runtime Context

Best Incident Management Software for Engineering Teams (2026)

Feb 23, 2026 By Sahil Khan In Last9

Compare 9 incident management tools: PagerDuty, Opsgenie, Incident.io, Rootly, FireHydrant, BetterStack, Grafana OnCall, Squadcast, and Last9. Features, pricing, and which fits your team. Product Marketing Manager.

Read Post

Last9

Read more about Best Incident Management Software for Engineering Teams (2026)

AI SRE in Practice: Accelerating Engineer Onboarding with Contextual Expertise

Feb 22, 2026 By Itiel Shwartz In Komodor

Onboarding new engineers to complex Kubernetes environments is expensive. Junior engineers need to learn cluster architecture, understand organizational conventions, navigate internal documentation, and build relationships with senior team members who can answer questions. The process takes weeks or months, and during that time, senior engineers spend significant time mentoring instead of working on complex problems.

Read Post

Komodor

Read more about AI SRE in Practice: Accelerating Engineer Onboarding with Contextual Expertise

Database Partitioning: Types, Strategies, and When to Use Each

Feb 22, 2026 By Prathamesh Sonpatki In Last9

How database partitioning works in PostgreSQL and MySQL. Range, list, and hash partitioning with SQL examples and guidance on when to partition vs shard. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Database Partitioning: Types, Strategies, and When to Use Each

Database Sharding: How It Works and When You Actually Need It

Feb 21, 2026 By Prathamesh Sonpatki In Last9

How database sharding works, common strategies (hash, range, directory), shard key selection, and the operational cost of running a sharded database in production. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Database Sharding: How It Works and When You Actually Need It

Database Performance Tuning: A Practical Guide for Production Systems

Feb 20, 2026 By Preeti Dewani In Last9

Tune PostgreSQL and MySQL for production with connection pooling, memory configuration, write path optimization, vacuum management, and lock contention fixes. Technical Product Manager at Last9.

Read Post

Last9

Read more about Database Performance Tuning: A Practical Guide for Production Systems

Traces Are Not Your Business Logic

Feb 19, 2026 By Mukta Aphale In Last9

Distributed traces track how your system processed a single request — not what your customers did over time. Confusing the two leads to poorly instrumented systems.

Read Post

Last9

Read more about Traces Are Not Your Business Logic

SQL Query Optimization: Techniques That Actually Improve Performance

Feb 19, 2026 By Sahil Khan In Last9

Find and fix slow SQL queries using execution plans, missing index detection, N+1 pattern fixes, and pagination strategies for PostgreSQL and MySQL. Product Marketing Manager.

Read Post

Last9

Read more about SQL Query Optimization: Techniques That Actually Improve Performance

Database Indexing: How It Works, Types, and When to Use It

Feb 18, 2026 By Faiz Shaikh In Last9

How database indexes work, when to use B-tree vs hash indexes, clustered vs non-clustered indexes, and how to tell if your indexes are actually helping.

Read Post

Last9

Read more about Database Indexing: How It Works, Types, and When to Use It

Code Is Cheap, Reliability Isn't: Owning Production in the AI era w/ Swizec Teller

Feb 16, 2026 By Rootly In Rootly

In this episode, Swizec Teller, author of the bestselling Scaling Fast, makes a bold claim: code is cheap, reliability is not. As AI coding tools accelerate feature development, the real competitive advantage shifts to operating systems reliably in production. We explore the hidden complexity of SRE work, the addictive nature of agentic coding, and why ownership — not automation — remains at the core of modern software engineering.

View Video

Rootly

Read more about Code Is Cheap, Reliability Isn't: Owning Production in the AI era w/ Swizec Teller

SRE Report: AI optimism and the economics of effort

Feb 10, 2026 By Denton Chikura In Catchpoint

For eight years, the survey behind the SRE Report has used a consistent methodology. That consistency allows us to track how reliability work evolves over time, rather than relying on snapshots. One of the most stable questions in the survey asks respondents to estimate how much of their work, on average, is spent on toil. Between 2020 and 2024, responses showed a gradual decline in reported toil.

Read Post

Catchpoint

Read more about SRE Report: AI optimism and the economics of effort

Reference architecture: The blueprint for safe and scalable autonomy in SRE and DevOps

Feb 9, 2026 By Leah Wessels In iLert

Everyone wants autonomous incident response. Most teams are building it wrong. ‍ The ultimate goal of autonomy in SRE and DevOps is the capacity of a system to not only detect incidents but to resolve them independently through intelligent self-regulation. However, true autonomy isn't born from automating random, isolated tasks. It requires a stable foundation: a Reference Architecture.

Read Post

iLert

Read more about Reference architecture: The blueprint for safe and scalable autonomy in SRE and DevOps

AI SRE in Practice: Tracing Policy Changes to Widespread Pod Failures

Feb 9, 2026 By Itiel Shwartz In Komodor

Policy changes in Kubernetes are supposed to improve security, enforce standards, or optimize resource usage. But when a policy change triggers cascading pod failures across multiple namespaces, the investigation becomes a race to identify what changed before more workloads are affected.

Read Post

Komodor

Read more about AI SRE in Practice: Tracing Policy Changes to Widespread Pod Failures

The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

Feb 5, 2026 By Udi Hofesh In Komodor

You might expect an AI-SRE agent to target 100% reliable services, ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a non-linear cost: maximizing stability limits how fast new features can be developed, dramatically increases the operational cost, and reduces the features a team can afford to offer.

Read Post

Komodor

Read more about The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

Building Trust in the Machine: A Guide to Architecting Agentic AI for SRE

Feb 4, 2026 By Itiel Shwartz In Komodor

The promise of Artificial Intelligence in Site Reliability Engineering (SRE) is seductive: an autonomous system that never sleeps, instantly detects anomalies, and fixes broken infrastructure while humans focus on high-value work. However, the gap between a demo-ready chatbot and a production-grade Autonomous AI SRE is vast. In complex, noisy environments like Kubernetes, a “naive” implementation of Large Language Models (LLMs) is not just ineffective, it can be dangerous.

Read Post

Komodor

Read more about Building Trust in the Machine: A Guide to Architecting Agentic AI for SRE

Komodor AI SRE vs. OSS AI Agent: A Technical Comparison of Agentic AI for Kubernetes Troubleshooting

Feb 2, 2026 By Nir Adler In Komodor

Gartner predicts that AI agents will be implemented in 60% of all IT operations tools by 2028, up from fewer than 5% at the end of 2024. This acceleration has sparked an explosion of AI SRE solutions, from enterprise platforms to open-source alternatives, all promising faster root cause analysis and reduced MTTR.

Read Post

Komodor

Read more about Komodor AI SRE vs. OSS AI Agent: A Technical Comparison of Agentic AI for Kubernetes Troubleshooting

Operations | Monitoring | ITSM | DevOps | Cloud

Lightrun Launches Industry's First AI SRE With Live Dynamic Runtime Context

Best Incident Management Software for Engineering Teams (2026)

AI SRE in Practice: Accelerating Engineer Onboarding with Contextual Expertise

Database Partitioning: Types, Strategies, and When to Use Each

Database Sharding: How It Works and When You Actually Need It

Database Performance Tuning: A Practical Guide for Production Systems

Traces Are Not Your Business Logic

SQL Query Optimization: Techniques That Actually Improve Performance

Database Indexing: How It Works, Types, and When to Use It

Code Is Cheap, Reliability Isn't: Owning Production in the AI era w/ Swizec Teller

SRE Report: AI optimism and the economics of effort

Reference architecture: The blueprint for safe and scalable autonomy in SRE and DevOps

AI SRE in Practice: Tracing Policy Changes to Widespread Pod Failures

The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

Building Trust in the Machine: A Guide to Architecting Agentic AI for SRE

Komodor AI SRE vs. OSS AI Agent: A Technical Comparison of Agentic AI for Kubernetes Troubleshooting

Monthly Archive

Follow Us