Operations | Monitoring | ITSM | DevOps | Cloud

Whose Fault Is It When the Cloud Fails? Does It Matter?

On Monday, October 20th, a significant portion of the digital services we use every day became inaccessible. For hours, banking, communication, and entertainment applications were unavailable. The root cause was later identified as a major outage within Amazon Web Services (AWS), the infrastructure that powers a vast number of online services. The initial response for any business affected by such an event is a frantic effort to diagnose the problem. Is it our application? Is our network down?

Your Root Cause Analysis is Flawed by Design

There’s a nagging feeling of déjà vu that haunts every network operations leader. You invest significant time and resources to resolve a major performance issue. Your best engineers isolate a culprit—a misbehaving load balancer, perhaps—and after a frantic effort, service is restored. You close the ticket, confident the problem is solved. Then, two weeks later, it’s back.

5 Best Practices for Incorporating AI Into Your Team

Honeycomb’s Jessica Kerr and Fred Hebert recently hosted a webinar with Courtney Nash of The VOID where they dug into one of the biggest questions in tech right now: How do we build systems (and teams) that actually learn with AI, not just use it? The conversation was surprisingly optimistic about what happens when we stop treating AI as a productivity tool and start seeing it as a teammate. You can watch the full webinar here, or read on below for a quick recap.

APM in 2026: The New Standard for Business Reliability and Growth

Global IT spending is expected to reach a record $6.08 trillion by 2026, with software investments growing by 15.2%. This shows how critical application performance has become for businesses today. For almost 80% of companies, even one hour of downtime can cost more than $300,000. In a world where every digital experience affects your revenue and brand reputation, keeping your applications performing well is no longer optional.

Sidecar or Agent for OpenTelemetry: How to Decide

Getting telemetry out of a distributed system isn’t the hard part. Getting it out cleanly, without noise, drop-offs, or odd performance side-effects — that’s where things get interesting. Before you worry about processors or storage costs, you need a clear plan for where the OTel Collector should run. Most teams narrow this down to two options: a sidecar that sits next to each service, or a node-level agent that handles data for everything running on the node. Both patterns are solid.

From Error to Insight: Our Brand Refresh

Software teams do their best work when they can move quickly without losing control. That reality has shaped how our product has evolved, and it needed to shape how our brand shows up too. Our refresh is not a new coat of paint. It is an honest reflection of what Rollbar is today and where we are going: a code-first observability platform that helps builders understand what is happening in their code and why, so every release is better than the last.

Sovereignty over silence: Why Microsoft's data opacity is the real lock-in

The refusal by Microsoft to detail data flows to Police Scotland confirms the real price of hyperscale: control is an illusion. This incident isn't the problem. It’s the proof. It proves the need for a new standard in cloud computing, one that prioritizes true digital sovereignty and architectural transparency. Sovereignty, after all, is all about the customer being able to exercise control over the IT resources they use.

Playwright Check Suites Are Now GA - But What Does That Mean For You?

There are only a few companies that successfully invest in actively monitoring real user flows in production. I’ve been puzzled by the state of the art for many years, because I’m an anxious developer that always needs to know that production is “all right”. How can it be okay for all of us to wait for error logs, thrown exceptions or customer complains to learn about production issues?