Monthly Archive

[Webinar] Conquering the Complexity of Self-Hosted Apps with Agentic AI SRE

Feb 26, 2026 By Komodor In Komodor

Most enterprise SaaS products, like Komodor’s Autonomous AI SRE Platform, require installing a remote agent on the customer’s infrastructure, which varies significantly from one organization to another, in terms of architecture, configurations, permissions, processes, and more. This “unmanaged” model creates major blind spots, making daily operations, observability, debugging, and incident response challenging. When failures occur, limited visibility and bespoke systems make root-cause analysis slow, incomplete, or impossible.

View Video

Komodor

Read more about [Webinar] Conquering the Complexity of Self-Hosted Apps with Agentic AI SRE

When AI Writes the Code, Who Keeps Production Running?

Feb 23, 2026 By Ilan Adler In Komodor

The production environment has become a minefield of code nobody really understands. Here’s what’s happening: Development teams are using Claude Code, Cursor, and GitHub Copilot to ship features at 10x their previous velocity. Product managers are ecstatic. Business stakeholders are thrilled. And somewhere in a war room at 2:17 AM, an SRE is staring at a stack trace for code that was AI-generated three weeks ago, trying to figure out why the payment service just fell over.

Read Post

Komodor

Read more about When AI Writes the Code, Who Keeps Production Running?

AI SRE in Practice: Accelerating Engineer Onboarding with Contextual Expertise

Feb 22, 2026 By Itiel Shwartz In Komodor

Onboarding new engineers to complex Kubernetes environments is expensive. Junior engineers need to learn cluster architecture, understand organizational conventions, navigate internal documentation, and build relationships with senior team members who can answer questions. The process takes weeks or months, and during that time, senior engineers spend significant time mentoring instead of working on complex problems.

Read Post

Komodor

Read more about AI SRE in Practice: Accelerating Engineer Onboarding with Contextual Expertise

AI SRE in Practice: Diagnosing AWS CNI IP Exhaustion Before Widespread Outage

Feb 16, 2026 By Itiel Shwartz In Komodor

IP address exhaustion in Kubernetes doesn’t announce itself with clear error messages. Pods fail to schedule, services degrade unpredictably, and the symptoms look like a dozen different problems before anyone realizes the cluster has run out of available IP addresses. By the time the root cause becomes clear, multiple services are affected and recovery requires coordination across infrastructure layers.

Read Post

Komodor

Read more about AI SRE in Practice: Diagnosing AWS CNI IP Exhaustion Before Widespread Outage

#053 - The Road to Distributed AI and Kubernetes Infrastructure with Matt Butcher (Fermyon) & Ari...

Feb 13, 2026 By Komodor In Komodor

They share their professional origins, highlighting how Kubernetes transitioned from a complex tool for experts to a foundational technology for global enterprises.. Part of the conversation focuses on the history of Helm, explaining its growth from a simple hackathon project into a standard package manager. Another part takes on the future of distributed computing, specifically how Akamai is integrating infrastructure as a service to support modern workloads.

View Video

Komodor

Read more about #053 - The Road to Distributed AI and Kubernetes Infrastructure with Matt Butcher (Fermyon) & Ari...

AI SRE in Practice: Tracing Policy Changes to Widespread Pod Failures

Feb 9, 2026 By Itiel Shwartz In Komodor

Policy changes in Kubernetes are supposed to improve security, enforce standards, or optimize resource usage. But when a policy change triggers cascading pod failures across multiple namespaces, the investigation becomes a race to identify what changed before more workloads are affected.

Read Post

Komodor

Read more about AI SRE in Practice: Tracing Policy Changes to Widespread Pod Failures

The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

Feb 5, 2026 By Udi Hofesh In Komodor

You might expect an AI-SRE agent to target 100% reliable services, ones that never fail. It turns out that past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a non-linear cost: maximizing stability limits how fast new features can be developed, dramatically increases the operational cost, and reduces the features a team can afford to offer.

Read Post

Komodor

Read more about The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

From Blueprint to Production: Building a Kubernetes MCP Server

Feb 5, 2026 By Nir Adler In Komodor

As Large Language Models (LLMs) evolve from simple chatbots into agentic workflows, the need for a standardized way to connect them to external data and infrastructure has become critical. In a recent workshop hosted by Nir Adler, Innovation Engineer at Komodor, we explored how to bridge this gap using the Model Context Protocol (MCP).

Read Post

Komodor

Read more about From Blueprint to Production: Building a Kubernetes MCP Server

#052 - The "Short Long Path": Mastering Abstraction, Culture, and Kubernetes Scale with Shemer M...

Feb 4, 2026 By Komodor In Komodor

In this episode, Itiel joins forces with Shemer, Director of Platform Solutions at the gaming giant Playtika, and Scott Rosenberg, Lead Architect at TeraSky, to discuss the realities of platform engineering at a massive scale. The trio dissects Playtika’s multi-year journey from a legacy, homegrown Kubespray infrastructure to a modern, holistic platform built on Spectro Cloud, all while running strictly on-premise to support 25+ games and high-volume traffic.

View Video

Komodor

Read more about #052 - The "Short Long Path": Mastering Abstraction, Culture, and Kubernetes Scale with Shemer M...

Building Trust in the Machine: A Guide to Architecting Agentic AI for SRE

Feb 4, 2026 By Itiel Shwartz In Komodor

The promise of Artificial Intelligence in Site Reliability Engineering (SRE) is seductive: an autonomous system that never sleeps, instantly detects anomalies, and fixes broken infrastructure while humans focus on high-value work. However, the gap between a demo-ready chatbot and a production-grade Autonomous AI SRE is vast. In complex, noisy environments like Kubernetes, a “naive” implementation of Large Language Models (LLMs) is not just ineffective, it can be dangerous.

Read Post

Komodor

Read more about Building Trust in the Machine: A Guide to Architecting Agentic AI for SRE

Komodor AI SRE vs. OSS AI Agent: A Technical Comparison of Agentic AI for Kubernetes Troubleshooting

Feb 2, 2026 By Nir Adler In Komodor

Gartner predicts that AI agents will be implemented in 60% of all IT operations tools by 2028, up from fewer than 5% at the end of 2024. This acceleration has sparked an explosion of AI SRE solutions, from enterprise platforms to open-source alternatives, all promising faster root cause analysis and reduced MTTR.

Read Post

Komodor

Read more about Komodor AI SRE vs. OSS AI Agent: A Technical Comparison of Agentic AI for Kubernetes Troubleshooting

Operations | Monitoring | ITSM | DevOps | Cloud

[Webinar] Conquering the Complexity of Self-Hosted Apps with Agentic AI SRE

When AI Writes the Code, Who Keeps Production Running?

AI SRE in Practice: Accelerating Engineer Onboarding with Contextual Expertise

AI SRE in Practice: Diagnosing AWS CNI IP Exhaustion Before Widespread Outage

#053 - The Road to Distributed AI and Kubernetes Infrastructure with Matt Butcher (Fermyon) & Ari...

AI SRE in Practice: Tracing Policy Changes to Widespread Pod Failures

The AI-Empowered Site Reliability Engineer: Automating the Balance of Risk and Velocity

From Blueprint to Production: Building a Kubernetes MCP Server

#052 - The "Short Long Path": Mastering Abstraction, Culture, and Kubernetes Scale with Shemer M...

Building Trust in the Machine: A Guide to Architecting Agentic AI for SRE

Komodor AI SRE vs. OSS AI Agent: A Technical Comparison of Agentic AI for Kubernetes Troubleshooting

Monthly Archive

Follow Us