Operations | Monitoring | ITSM | DevOps | Cloud

How Datadog Feature Flags is resilient to cloud provider failures

As major incidents like AWS’s October 2025 outage illustrate, modern systems are immensely interconnected. A failure in one can lead to a cascade of downstream problems. In this case, issues with DNS resolution for DynamoDB led to widespread disruptions with other AWS services and, subsequently, thousands of applications and services that rely on that infrastructure.

When to Move From Public Internet to Private Connectivity

Struggling with latency, congestion, or compliance issues? Discover when it’s time to move from public internet to private connectivity. Network operations have never demanded more than they do now, leading many network managers to question whether the public internet is enough. While many organizations begin their network journey with VPNs over the public internet, they often bump into limitations quickly and begin exploring the natural next step – private connectivity.

AI as Monitive's CEO

Recently I've been to Lisbon's Web Summit conference, a 3 day, 70,000 participants, 15 stages, 800+ speakers event. Even though there was a track called "AI Summit", all the talks were about AI and AI Agents and how the future of the web, business, economy is more and more AI, and how businesses and people should take steps to adapt as soon as possible to an online world managed and operated by Artificial Intelligence.

The "Meh-trics" Reloaded: Why I Was 100% Wrong About Metrics (and Also 100% Right)

Okay, I'm going to say something that would make 2016 Charity want to throw her laptop across the room: we're making a major investment in metrics at Honeycomb. I know, I know. "But Charity, you literally called them ‘shit salad!’" I did. Also "nerfed dimensions." I said they would "fucking kneecap you." For most of the past decade, I've been social media’s most reliable anti-metrics evangelist. Have I repented? No.

Enhancements to Honeycomb Telemetry Pipeline Deliver Greater Visibility, Smarter Control, and Lower Costs

In July, we introduced powerful new Honeycomb Telemetry Pipeline features that helped teams take control of their observability data with safe sampling, flexible rehydration, and a visual pipeline builder. Since then, we’ve built on that foundation. Today, we’re introducing the latest enhancements to Honeycomb Telemetry Pipeline, which give teams deeper visibility into pipeline health, more efficient access to archived telemetry data, and reduced operational complexity.

Get more from your AI chief of staff with these prompts for engineering leaders

Engineering leaders face a constant barrage of questions that pull them away from strategic work. A team lead asks about scorecard compliance. A PM wants a status update on a migration. Someone needs incident trend data for a quarterly review. Each question is reasonable. Each requires context switching, digging through dashboards, or pinging someone on your team for a report. What if you could just ask?

KubeCon North America 2025: OpenTelemetry Recap from Atlanta

KubeCon + CloudNativeCon North America 2025 wrapped up in Atlanta last week, and it sure did feel like a big one for OpenTelemetry. Between Observability Day, the project updates, and the activity around the OpenTelemetry Observatory booth, you could feel how quickly the ecosystem is maturing.

AWS Cost Categories Explained (How To Allocate AWS Spend Accurately)

If you’ve ever tried to make sense of your AWS bill, you know how fast things get messy. Different accounts, hundreds of services, random tags, and suddenly, no one can say for sure who’s spending what or why the total looks so high. It’s not that teams don’t care about costs — it’s that AWS billing data isn’t always easy to interpret. Finance wants accountability. Engineering wants visibility. And somewhere between the two, ownership disappears.

The metrics product we built worked - But we killed it and started over anyway

Two years ago, Sentry built a metrics product that worked great on paper. But when we dogfooded it, we realized it was not what our customers really needed. Two weeks before launch, we killed the whole thing. Here’s what we learned, why classical time-series metrics break down for debugging modern applications, and how we rebuilt the system from scratch.

Making Your Business Resilient Against Cloudflare Like Outages

Cloudflare-like outages can cost your business a significant amount of money. This week’s Cloudflare global outage is a wake-up call for business resilience. You can stay resilient against such outages by regularly performing resilience testing and updating your application or infrastructure configurations.