Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Containers, Kubernetes, Docker and related technologies.

How to build sustainable AI infrastructure on GPU cloud

AI's environmental cost is real, and it's growing. Training a large language model can consume the electricity of hundreds of households for weeks. Inference at production scale runs continuously, with GPU clusters drawing power around the clock. The data centers that house all of this are some of the most concentrated energy consumers in the modern technology stack.

Platform engineering unplugged: What nobody tells you about platform engineering at scale

Most platform engineering stories are told in hindsight, with the rough edges smoothed out. On June 17th, we are doing it differently. Join us for Platform Engineering Unplugged, a frank conversation with a practitioner who has navigated the real challenges of building and scaling platform engineering. What worked, what didn't, and what they would do differently. If you lead engineering teams and are thinking seriously about platform engineering, this is the session for you.

How to build a secure AI agent sandbox with relaxAI and Claude Code

AI agents are powerful. They're also unpredictable, non-deterministic, and capable of doing things you didn't ask them to do, as the Rome Alibaba and Claude Mythos case studies make very clear. The answer isn't to avoid agentic AI. It's to run it properly. In this demo, Ben Norris, founding engineer at relaxAI, shows how to build a fully sandboxed AI agent environment from scratch, an ephemeral Civo VM provisioned via Terraform and GitHub Actions, locked down with egress policies, an unprivileged Linux user, and hard resource caps, running a Claude Code session pointed at the relaxAI API.

Klaudia Under the Hood: How We Built an AI SRE That Actually Earns Trust

In reliability engineering, being ‘mostly right’ is a liability. An AI SRE that sometimes misses the root cause or gives a confident, wrong answer at 2:17 AM has no place in an enterprise cloud environment. In this context, silence is better than noise. That’s the bar Klaudia is built to clear: genuine reliability that you can trust in production. The kind of reliability that earns a place alongside your best engineers. Getting there requires more than just a capable model.

Lock-in is not theoretical: What UK organizations told us about cloud exit barriers

For years, vendor lock-in has been discussed as a theoretical risk. A concern to acknowledge in architecture reviews. A box to tick in compliance frameworks. A future problem that might need addressing. Our latest research reveals something more urgent. For UK organizations, lock-in isn't theoretical anymore. It's structural. It's measurable. And it's preventing organizations from acting on their own strategic priorities.

Why We Built Lynx: Bringing Control to the Age of AI Agents

For a decade, one idea has guided everything we’ve built at Tigera: How do you secure a dynamic system with a lot of moving parts that is changing rapidly, with a programmatic approach? Calico has applied that idea for Global 2000 companies running the largest Kubernetes platforms in the world, securing tens of millions of mission-critical transactions every day. Today I’m excited to announce the next chapter of that work: Lynx, a unified control plane for Kubernetes-native AI agents.

Kubernetes Monitoring: Datadog Alert to Lightrun Root Cause

Datadog Kubernetes monitoring tells an SRE team what failed, which pod failed, and when. It does so within seconds of the alert firing. The investigation then stalls at the same point every time: nothing in the dashboard layer can prove why a specific request behaved the way it did inside a running JVM at the moment of failure. Variable values, feature flag evaluations, and code branches are never captured.