Operations | Monitoring | ITSM | DevOps | Cloud

AWS CloudFront Outage (Feb 2026): Timeline, Cascade, and Lessons

At approximately 9:15 PM UTC on February 10, 2026, Amazon CloudFront began returning NXDOMAIN responses for DNS queries against specific distributions. In practical terms: DNS was telling users that services behind those distributions simply didn't exist. The root cause was a DNS resolution failure within CloudFront's infrastructure that quickly spread to eight interconnected AWS services.

Silent Failures: Why AI Code Breaks in Production

You ship a small “safe” change on Friday. The diff is tiny, the tests are green, and the AI assistant was confident. An hour after deploy, your on-call channel lights up. A downstream service is rejecting responses that look fine in code review. Now you’re rolling back and rewriting a fix that should have been obvious if you had real traffic in the loop. This isn’t a hypothetical.

Top Kubernetes interview questions of 2026: A beginners guide

Having been around for a decade, the world's most popular container orchestrator has set a standard for how we run containers at scale. According to the CNCF, cloud-native adoption has reached 98% across organizations, showing that Kubernetes adoption is not slowing down. Whether you are looking to land your first kubernetes role or you are experienced and are looking to brush up on your knowledge, we’ve put together the top questions to learn more about Kubernetes.

A new perspective on dashboard sprawl

Dashboards are supposed to answer questions, not create more of them. But investigations don't stop at a single view. The moment you want to understand one specific thing in detail like a failing VM, a degraded service, a slow pipeline, dashboards start to break down. You end up either building yet another dashboard or searching through many different ones. SquaredUp's Perspectives changes this.

A Notification List Is Not a Team

In the previous post, we looked at how alert noise is rarely accidental. It’s usually the result of sensible decisions layered over time, until responsibility becomes diffuse and response slows. One of the most persistent assumptions behind this pattern is simple. If enough people are notified, someone will take responsibility. After more than fourteen years of working with engineering teams of every size and shape, we’ve seen this assumption fail repeatedly.

Happy Birthday to Us: Honeycomb 10 Year Manifesto, Part 1

Christine and I started Honeycomb in 2016, which means it’s been ten years. Christine, a developer, and I, an operations engineer, were both profoundly unhappy with the state of the art in monitoring and logging tools. The tools we had used at Facebook didn’t spray our signals around to a bunch of siloed-off pillars. They consolidated as much context as possible so we could properly explore it, the way every other non-software engineering team already takes for granted.

Agent vs Assistant: The key distinction between Olly and the competition

The market is saturated with agents and assistants, making it difficult to tell them apart. However, the difference between these two approaches is significant. They offer radically distinct levels of impact, reflecting major differences in both their technical complexity and the quality of their inferences. Let’s figure out the distinction.

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival

A lot of talk around OpenTelemetry has to do with instrumentation, especially auto-instrumentation, about OTel being vendor neutral, being open and a defacto standard. But how you use the final output of OTel is what makes business difference. In other words, how do you use it to make your life as an SRE/DevOps/biz person easier? How do you have to set things up to truly solve production issues faster?

Why Monitoring Matters for Modern Hosting Platforms

With all the discussion in the dev community lately about changes made at Heroku, we wanted to use this moment to talk about PaaS (Platform as a Service) providers and how AppSignal can be a vital tool to ensure you're using your app's hosts for everything from optimal performance to lower usage bills.