Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Improvements to our status pages as we tackle a DDoS

The uptime & availability of our status pages hasn't been great these past few days. The root cause is a persistent and pretty aggressive DDoS attack targeted at our own status page, status.ohdear.app. As a result, the overload on our systems also affected all other status pages we host for clients. We're not yet at Github or Claude levels of uptime sadness, but this isn't acceptable to us. In this post, I'll share what's happening and what steps we've already taken.

You Are Building With AI. Who Is Watching What It Ships?

AI coding assistants have made it possible for a single developer to build and ship a production application in a weekend. Claude Code, Cursor, GitHub Copilot, and similar tools can scaffold a Rails app, write the models, generate the views, wire up the API, and push to production before Monday. This is genuinely exciting. It is also genuinely dangerous if you do not have monitoring in place before you ship.

Best APM for Small Development Teams in 2026

Last updated: May 2026 If your team is 2 to 20 developers and you do not have dedicated DevOps, SRE, or platform engineering, most APM tools were not built for you. They were built for the team that has you: a team with specialists who can tune dashboards, configure alerting pipelines, manage data retention policies, and explain the monitoring system to everyone else. You do not have that team. You have developers who also handle deploys, on-call, and debugging production issues between writing features.

Get deeper insights with historical outage reports

StatusGator now includes a new Outage Reports tab on the service monitor detail page, giving users more visibility into recent service disruptions directly where they monitor services. Users can now quickly review recent outage activity for a specific monitored service without leaving the detail page.

Cloud Outage History: Six Years of Recurring Failures

Cloud infrastructure has never been more reliable in theory. In practice, the last six years of cloud outage history have delivered some of the most disruptive incidents on record. Not because cloud providers got worse, but because the systems built on top of them got larger, more interconnected, and more brittle in ways that don't show up until everything breaks at once.

What Is an Incident Commander? Role, Skills, and Best Practices

The fastest incident response teams treat coordination as a craft. Someone owns the call, drives the decisions, and keeps everyone moving in the same direction while the team puts the system back together. That person is the incident commander (IC), and getting the role right is what separates your 15-minute fix from a four-hour war room where nobody’s sure who’s making the call.

What Is APM? A Guide to Application Performance Monitoring

A well-instrumented service tells your on-call engineer which deploy broke checkout, which span ate the latency budget, and which line to revert before the support queue fills up. Getting there depends on how cleanly your application performance monitoring layer turns telemetry into answers. The sections ahead walk through how APM works, the metrics and components worth tracking, the cloud-native challenges at scale, and how to evaluate APM tooling against your real workload.

From Monitoring to Observability: How DEX Integrations Strengthen IT Visibility and User Productivity

When I started working in IT in the last 90’s, IT performance was always measured by the health of infrastructure: CPU utilization, network latency, server uptime, and for many organizations, little has changed in the last 30+ years. We became very good at keeping systems alive, yet users still struggled to get work done. That disconnect is exactly why Digital Employee Experience (DEX) has emerged as a critical discipline. But DEX on its own is not the end goal.

Honeycomb Innovation Week: Debugging Agentic Workflows with Ken Rimple

Canvas skills are how your team's runbooks and tribal knowledge become an active part of the investigation instead of a document someone has to remember to open. Pre-built skills cover the most common investigation patterns out of the box. Custom skills let you encode the specific context, thresholds, and decision logic your team has accumulated, so every auto-investigation starts with your best thinking already applied.