Operations | Monitoring | ITSM | DevOps | Cloud

The "Meh-trics" Reloaded: Why I Was 100% Wrong About Metrics (and Also 100% Right)

Okay, I'm going to say something that would make 2016 Charity want to throw her laptop across the room: we're making a major investment in metrics at Honeycomb. I know, I know. "But Charity, you literally called them ‘shit salad!’" I did. Also "nerfed dimensions." I said they would "fucking kneecap you." For most of the past decade, I've been social media’s most reliable anti-metrics evangelist. Have I repented? No.

Canvas Is Now GA: AI-Guided Observability for Modern Teams

When we introduced Canvas in beta, our goal was to reimagine how teams explore and collaborate around their observability data without requiring manual querying. Canvas has quickly become the AI-guided workspace that helps teams transform raw telemetry into meaningful, shared understanding faster than ever before. And today, we’re thrilled to announce that Canvas is now Generally Available (GA) for all Honeycomb users.

AI as Monitive's CEO

Recently I've been to Lisbon's Web Summit conference, a 3 day, 70,000 participants, 15 stages, 800+ speakers event. Even though there was a track called "AI Summit", all the talks were about AI and AI Agents and how the future of the web, business, economy is more and more AI, and how businesses and people should take steps to adapt as soon as possible to an online world managed and operated by Artificial Intelligence.

How Datadog Feature Flags is resilient to cloud provider failures

As major incidents like AWS’s October 2025 outage illustrate, modern systems are immensely interconnected. A failure in one can lead to a cascade of downstream problems. In this case, issues with DNS resolution for DynamoDB led to widespread disruptions with other AWS services and, subsequently, thousands of applications and services that rely on that infrastructure.

What is AWS Fargate for Amazon ECS?

As cloud applications moved from VMs to containers and then to microservices, the amount of background work needed to keep everything running grew just as quickly. You gain speed and flexibility, but you also end up managing clusters, scaling rules, and capacity choices that don’t really add to the product you’re building. AWS Fargate steps in right there. It lets you run your ECS tasks without looking after any servers at all.

OTel Updates: Complex Attributes Now Supported Across All Signals

OpenTelemetry now supports maps, heterogeneous arrays, and byte arrays across all signals. Here’s where these new types shine — and where simple primitives still fit naturally. If you’ve been working with OpenTelemetry for a while, you’re likely familiar with the straightforward key-value approach to attributes. It’s simple, fast, and works well with how most telemetry backends store, index, and query data.

Navigating External Outages: How Selector Cuts Through the Cloudflare Noise

Yesterday’s widespread Cloudflare outage reminds us how crucial external dependencies are to the stability of our own applications. When a key edge provider like Cloudflare goes down, the impact on your internal monitoring systems can look like a catastrophic, internal system failure triggering a massive storm of alerts and sending engineering teams into frantic, misdirected debugging sessions.

OnlineOrNot's lessons from Cloudflare's outage on 2025-11-18

On 2025-11-18 at 11:48 UTC, Cloudflare declared an incident affecting the global network (that also affected OnlineOrNot). OnlineOrNot monitors websites, APIs, web apps, and cron jobs, while providing status pages as well. While we partially mitigated the issue by enabling a fallback to AWS-based monitoring, between 13:00 UTC and 14:33 UTC failing checks went unreported, heartbeat checks over-reported, and status pages were unavailable.

AI-Suggested Alert Thresholds for Mobile Telemetry

Life is pretty good. I’ve shipped a mobile app and I’m (happily) drowning in telemetry. Battery impact, time in foreground/background per screen, crash rates, slow frames, network retries – the works. The data is brilliant; the challenge is turning signals into reliable alerts that catch real issues which are relevant to my app’s functions. So… what should I actually listen for, and where should I set the thresholds?

Outage map now available in your StatusGator board

We’re excited to introduce a helpful new update to your StatusGator experience – the service outage map is now built directly into your StatusGator account. StatusGator has displayed outage heatmaps on our public website’s service landing pages. These maps helped users understand where issues were being reported across the globe. Now, we’ve taken that same valuable visibility and placed it inside your board.