Operations | Monitoring | ITSM | DevOps | Cloud

Building Trust in the Machine: A Guide to Architecting Agentic AI for SRE

The promise of Artificial Intelligence in Site Reliability Engineering (SRE) is seductive: an autonomous system that never sleeps, instantly detects anomalies, and fixes broken infrastructure while humans focus on high-value work. However, the gap between a demo-ready chatbot and a production-grade Autonomous AI SRE is vast. In complex, noisy environments like Kubernetes, a “naive” implementation of Large Language Models (LLMs) is not just ineffective, it can be dangerous.

Komodor AI SRE vs. OSS AI Agent: A Technical Comparison of Agentic AI for Kubernetes Troubleshooting

Gartner predicts that AI agents will be implemented in 60% of all IT operations tools by 2028, up from fewer than 5% at the end of 2024. This acceleration has sparked an explosion of AI SRE solutions, from enterprise platforms to open-source alternatives, all promising faster root cause analysis and reduced MTTR.

How Cisco Revolutionized Platform Engineering with Komodor's Agentic AI

In the world of cloud-native infrastructure, complexity is the silent killer of innovation. For Cisco Outshift, the company’s incubation engine, managing a sprawling environment of AWS EKS clusters and edge-based MicroK8s workloads created a classic bottleneck: the Platform Engineering team was drowning in toil. Facing SRE burnout and the limits of human scaling, Cisco embarked on an ambitious journey to evolve its internal operations from standard DevOps to Agentic AI.

#051 - Surviving the Shift: From Legacy Monoliths to Day 2 Chaos with Hayato Shimizu (Digitalis)

From the early days of "neural nets" and WebSphere to the modern complexities of Kubernetes, Hayato Shimizu has seen the evolution of infrastructure firsthand. In this episode of Kubernetes for Humans, the co-founder of Digitalis joins the show to discuss the harsh realities of enterprise platform engineering and his personal journey from corporate employee to consultancy owner.

AI SRE in Practice: Resolving Node Termination Events at Scale

When a node terminates unexpectedly in a Kubernetes cluster, the immediate symptoms are obvious. Workloads restart elsewhere, services experience partial outages, and alerts fire across multiple systems. The harder question is why it happened and how to prevent it from recurring. This scenario walks through a node termination event where the entire node pool was affected, requiring investigation across infrastructure layers to identify root cause and implement lasting remediation.

[Webinar] Building Quality-Driven Agentic AI in Noisy Big Data Environments

Watch as Itiel Shwartz, Komodor CTO and Co-Founder as he shares hard-won lessons from developing an AI agent that processes millions of K8s events daily to deliver autonomous troubleshooting that reached 95%+ accuracy in benchmarking. This webinar covers: Building production ready systems that maintain reliability when 90% of your data is noise. How Komodor developed an AI SRE agent that processes millions of K8s events daily to deliver autonomous troubleshooting that reached 95%+ accuracy in benchmarking.

AI SRE in Practice: Diagnosing Configuration Drift in Deployment Failures

Deployments fail for dozens of reasons. Most of them are obvious from the error messages or pod events. But when a deployment rolls out successfully according to Kubernetes but your application starts experiencing latency spikes and error rate increases, the investigation becomes significantly harder. This scenario walks through a configuration drift incident where the deployment appeared healthy but available replicas were constantly flapping, creating cascading reliability issues.

AI SRE in Practice: Resolving GPU Hardware Failures in Seconds

When a pod fails during a TensorFlow training job, the investigation usually starts with the obvious questions. The answers rarely come quickly, especially when the failure involves GPU hardware that most engineers don’t troubleshoot regularly. This scenario walks through an actual GPU hardware failure and shows how AI-augmented investigation changes both the time to resolution and the expertise required to handle it.

When is it ok or not ok to trust AI SRE with your production reliability?

There’s a moment every engineer knows. An AI suggests a fix, it looks reasonable,maybe even obvious, but production is on the line and you hesitate before clicking execute. There’s a big difference between an AI that can recommend an action and one you’re willing to let take that action. All it takes is one bad call, one kubectl command that makes things worse, and suddenly every automated suggestion is a potential liability instead of a help.

From Promise to Practice: What Real AI SRE Can Actually Do When Production Breaks

We’ve written before about the advantages of training an AI SRE on real telemetry data rather than generic Kubernetes documentation. We’ve explained why RAG augmentation based on actual high-scale workload patterns produces better results than LLMs trained on generic scenarios or forum threads. The theory makes sense, the architecture is sound, and the approach is defensible.