A slow checkout request. A background job stuck waiting on another service. A log message that looks fine — until performance drops. In a Node.js microservices setup, these are the moments that test your observability. You know something's wrong, but tracing the request across dozens of services feels impossible. Distributed tracing changes that. It connects every span in the request's journey, showing exactly where time is spent and where things start to break down.
Most developers use auto-instrumentation as it’s meant to be used — run the Java agent, add NODE_OPTIONS, and telemetry starts flowing. When it stops, though, figuring out why can be tricky. Maybe the agent didn’t load, maybe there’s a framework version mismatch, or something else entirely. Understanding how auto-instrumentation works makes it easier to spot and fix these issues.
Browser tracing has always been one of those things that feels invisible until it isn’t. When it works well, you get clear, actionable insights into how your app is performing in the wild. When it doesn’t, you’re left staring at noisy data, gaps in traces, and spans that don’t quite tell the story. Over the last few months, we’ve been chipping away at that problem.
Distributed tracing is an essential technique for monitoring modern, cloud-native applications. It provides a holistic view of a request's entire journey as it propagates through a multi-service architecture, making it invaluable for performance optimization and root cause analysis. But how do you generate and collect this trace data in a standardized, vendor-agnostic way? That's where OpenTelemetry comes in.
In the previous post, What is OpenTelemetry?, we went over the What, Why, and the How of OpenTelemetry. We also went over the telemetry data lifecycle (data generation à collection à storage à usage) and how telemetry data (MELT) could be put to use to troubleshoot a representative web application scenario.
Your checkout flow starts in an AWS Lambda function, calls a payment service running on EKS, then triggers notifications through Azure Functions. Three different compute platforms, two cloud providers, one distributed trace that you can't see. Cloud providers want you to use their native monitoring tools. AWS pushes X-Ray and CloudWatch. Azure promotes Application Insights and Azure Monitor. These tools work well within their ecosystems but lock you into vendor-specific implementations.
Your production checkout flow just started returning 500 errors. Six microservices handle checkout. Logs show errors in three of them. Which service broke? Which error happened first? What caused the cascade? Traditional debugging doesn't work. You can't attach a debugger to production. Searching logs across six services gives thousands of lines with no obvious connection. By the time you correlate timestamps and trace IDs manually, customers have abandoned their carts.
OpenMetrics and OpenTelemetry are popular standards for instrumenting cloud-native applications. Both projects are part of the Cloud Native Computing Foundation (CNCF) and aim to simplify how we generate, collect and monitor services in a modern cloud-native distributed application environment. Let's have a look at how both the standards are aiming to help solve the observability conundrum.
It’s 3 AM. API latency just spiked from 200ms to 2s. Alerts are firing, and users are frustrated. You SSH into the first server: top, free -h, iostat — nothing unusual. On to the next host. And the next. That’s how most of us learned to debug. The tools worked, and we got good at using them. But as infrastructure became distributed and dynamic, this approach started to break down. Modern monitoring needs more than SSH and top. It needs unified telemetry.