Operations | Monitoring | ITSM | DevOps | Cloud

The 9 Application Performance Metrics You Need to Measure and Why

The tension between shipping speed and application performance has not changed much since this post was first published in 2020. What has changed is how quickly a team can detect, diagnose, and fix a problem. That difference is significant enough to warrant a revisit. The scenario from the original still plays out every week. Sales brings a priority feature that might degrade performance for some customers. The developer ships it and watches what happens.

AppSignal MCP Now Supports OAuth - and GitHub Copilot

When we launched AppSignal MCP in beta, OAuth was on the roadmap but not yet shipped. We were issuing static bearer tokens — enough to connect Claude Desktop, Cursor, and Windsurf, but not the one-click install path in the MCP Registry, and not GitHub Copilot's recommended setup. That's fixed.

Why IncidentHub's Alerting is Better than Other Status Page Aggregators'

IncidentHub tracked 48000 SaaS and Cloud outages in 2025. The average organization depends on 100+ SaaS apps, making third-party vendor monitoring a crucial aspect of risk management and business continuity for almost all modern organizations. Better SaaS outage alerting is about monitoring the right parts of your third-party services, and routing alerts to the right people at the right time.

When AWS us-east-1 Fails, Much of the Internet Fails With It

There are cloud outages, and then there are us-east-1 outages. That distinction matters because failures in AWS’s Northern Virginia region rarely feel like ordinary regional incidents. They tend instead to expose something larger and more uncomfortable: too much of the modern internet still behaves as though one place is an acceptable concentration point for infrastructure, control, recovery, and communication. When us-east-1 goes wrong, the problem is not only that workloads fail.

The Shift Toward Autonomous Enterprises

In our previous post, Navigating the Complexities of Scaling AI in Enterprise Operations, we explored the “cost–human conundrum”, balancing the promise of automation and the realities of economics, skills, and governance. That discussion highlighted a critical inflection point: scaling AI is not just a technical challenge, but an organizational one.

What's New in InfluxDB 3 Explorer 1.7: Table Management, Data Import, Transforms, and More

InfluxDB 3 Explorer 1.7 is a step forward for anyone who wants to manage their time series data without constantly switching between the UI and a terminal. This release adds table-level schema management, the ability to import data from other InfluxDB instances, and a new Transform Data section to reshape your data, all within the Explorer UI.

Ephemeral Leaks and Automated BGP Route Leak Detection

Many BGP route leaks reported by automated detection systems are actually brief, low-impact artifacts of normal BGP convergence. Doug Madory examines examples from Cloudflare Radar, Routeviews, and Jared Mauch’s long-running leak detector to show how these “ephemeral leaks” arise, why they usually don’t disrupt traffic, and why they still matter for routing security.

N+1 Detection in AppSignal's OpenTelemetry Trace Timeline

N+1 query problems are one of the most common, and quietly damaging, performance issues in production applications. One extra query per record feels harmless in development. At scale, it becomes the reason your response times degrade and your database buckles under load. Today, AppSignal adds N+1 detection to its OpenTelemetry support. When we identify the pattern in a trace, we collapse the repetitive spans directly in the timeline, making the problem immediately visible in the trace itself.

Grave improvements: Native crash postmortems via Android tombstones

Native crashes on Android have always been harder to debug than they should be. The platform has its own crash reporter (debuggerd) that captures the crashing thread, every other running thread, register state, and memory maps into a file called a tombstone. Tombstones have been a part of Android for a long time; in fact, they’ve been there in one form or another since Android's first commit.

What Is an AI SRE? And Why Do They Need Live Runtime Evidence?

AI SREs are autonomous systems that handle incident triage, root cause analysis, and remediation by correlating logs, metrics, traces, and code signals. However, as they rely on pre-configured telemetry, the critical execution details of a specific failure, such as variable state and code paths, can often be missed. As a result, they either force users into manual redeploy loops or make inferences from partial data, diagnosing issues using probability rather than proof.