Operations | Monitoring | ITSM | DevOps | Cloud

Silent Failures: Why AI Code Breaks in Production

You ship a small “safe” change on Friday. The diff is tiny, the tests are green, and the AI assistant was confident. An hour after deploy, your on-call channel lights up. A downstream service is rejecting responses that look fine in code review. Now you’re rolling back and rewriting a fix that should have been obvious if you had real traffic in the loop. This isn’t a hypothetical.

AWS CloudFront Outage (Feb 2026): Timeline, Cascade, and Lessons

At approximately 9:15 PM UTC on February 10, 2026, Amazon CloudFront began returning NXDOMAIN responses for DNS queries against specific distributions. In practical terms: DNS was telling users that services behind those distributions simply didn't exist. The root cause was a DNS resolution failure within CloudFront's infrastructure that quickly spread to eight interconnected AWS services.
Sponsored Post

From cloud costs to cloud value: The role of performance analytics in increasing ROI

Many cloud providers offer services that scale with usage. However, unanticipated overutilization of compute instances, serverless functions, or managed databases can quickly drive up costs. Managing these resources effectively is crucial for keeping cloud spending predictable.
Sponsored Post

Kubernetes Load Testing Made Easy with Speedscale

Everybody knows working with Kubernetes is really hard. It's highly complicated. You have to know how to work with YAMLs, there's lots of stuff to deal with. The classic developer experience with YAML. But what if you could get complete visibility into your Kubernetes workloads and run realistic load tests without touching a single YAML file or running kubectl commands? In this walkthrough, I'll show you how Speedscale makes Kubernetes observability and performance testing as simple as point-and-click.

Kubernetes Network Observability: Comparing Calico, Cilium, Retina, and Netobserv

Calico, Cilium, Retina, and Netobserv: Which Observability Tool is Right for Your Kubernetes Cluster? Network observability is a tale as old as the OSI model itself and anyone who has managed a network or even a Kubernetes cluster knows the feeling: a service suddenly can’t reach its dependency, a pod is mysteriously offline, and the Slack alerts start rolling in. Investigating network connectivity issues in these complex, distributed environments can be incredibly time consuming.

The Future of StatusPal: Classic and Next

Over the past years, StatusPal has been the product teams rely on to communicate clearly during incidents and maintenance. It’s the product our customers use today, and it remains central to how we support critical communication. We want to share how we’re thinking about the future of StatusPal, what this means for the product you’re using today, and how a newer version we’re building fits into the picture.

VictoriaMetrics at FOSDEM, Cloud Native Days France, and CfgMgmtCamp Ghent

Last week, members of the VictoriaMetrics team, including myself, spoke at three very different but equally important community events: FOSDEM in Brussels, Cloud Native Days France in Paris, and CfgMgmtCamp in Ghent. Each event drew a different crowd with its own expectations, making them a good way to see where open source observability stands today and how VictoriaMetrics is adapting to real-world needs. The talks we gave were snapshots of the problems we are actively working on.

How to run checks on internal services with Grafana Cloud Synthetic Monitoring

Many critical services run inside private networks, where traditional monitoring tools and practices can’t offer full visibility. This makes it difficult to validate service availability and performance before problems impact your users. Synthetic Monitoring — a Grafana Cloud solution that helps you proactively monitor the performance of your applications and services — addresses this gap with a feature known as private probes.