Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Containers, Kubernetes, Docker and related technologies.

Kubernetes Monitoring: Datadog Alert to Lightrun Root Cause

Datadog Kubernetes monitoring tells an SRE team what failed, which pod failed, and when. It does so within seconds of the alert firing. The investigation then stalls at the same point every time: nothing in the dashboard layer can prove why a specific request behaved the way it did inside a running JVM at the moment of failure. Variable values, feature flag evaluations, and code branches are never captured.

The cloud bill explained: A guide for finance and engineering

The cloud bill arrives at the end of every month, and somewhere in it sits a line item that nobody outside the infrastructure team really understands. It might be called "data transfer," "egress," or "outbound bandwidth," and it might be 5% of the total or even 25%. Whatever it is, it tends to be the line that finance asks engineering about, and engineering struggles to explain in a way that finance can act on. The problem is that egress is a fee that hides in plain sight. It's not on the marketing page.

Why developer teams are rethinking their cloud provider this year

The default cloud choice for technically literate teams has shifted. It hasn't shifted dramatically; the major hyperscalers aren't going anywhere, and their enterprise position is still strong, but the conversation that used to start with "which hyperscaler" now genuinely starts with "what do we actually need." That's new.

How to monitor and optimize GPU utilization in the cloud

GPU utilization is one of the most expensive metrics in cloud infrastructure to get wrong. A GPU running at 30% utilization costs the same as one running at 90%, but it's doing a third of the useful work. For workloads measured in tens of thousands of GPU-hours, the difference between average utilization in the 30s and average utilization in the 70s is hundreds of thousands of dollars across the life of the workload.

A New Console for Qovery

We rebuilt large parts of the Qovery Console: new navigation, overviews at every level, dark mode, and a modernized frontend architecture with TanStack Router and React Suspense. Rémi is a staff frontend engineer at Qovery. He writes about frontend architecture, developer experience, and building scalable UI systems for platform engineering tools. Théo is a senior product designer at Qovery.

What is Cloud Security - Explained in 5 minutes

Cloud security isn't just about locking things down — it's about staying ahead of threats in fast-moving, dynamic environments. In this video, Kat breaks down what cloud security actually means in 2024 and why traditional approaches don't cut it anymore. In this video: Whether you're securing containers, Kubernetes workloads, or multi-cloud infrastructure, this is your foundation. Subscribe for more cloud security explainers, tutorials, and best practices from Sysdig.

The next era of telco clouds: get open infrastructure choice with Sylva and Canonical Kubernetes

The telco industry is undergoing a fundamental change. Over the past few years, the increasing maturity of cloud-native infrastructure has accelerated the movement from manually operated and hardware-centric systems to automated, software-defined platforms. Underpinning this change are open source initiatives such as the Sylva project. Sylva is hosted by Linux Foundation Europe and heavily backed by major telecom operators and vendors.

#060 - Beyond ELK: Elastic's 10-Year Evolution, Open-Source Licensing, and the AI Frontier with P...

In this episode of the Kubernetes for Humans podcast, Philipp shares his incredible 10-year journey at Elastic, witnessing the company's massive growth from 300 to 4,000 employees. Discover the fascinating origin story of how Elastic evolved from a simple recipe search project into a global powerhouse for observability, security, and vector databases.

How to run self-hosted AI on your own infrastructure with Konstruct

Civo Platform Engineer M R Rishi demonstrates how to go from zero to self-hosted AI in minutes using Konstruct. While most teams are stuck managing thousands of configuration values across multiple models and tools, Rishi shows how Konstruct eliminates that complexity with GPU cluster provisioning, GitOps catalog deployments, and production-ready infrastructure on day zero.