Operations | Monitoring | ITSM | DevOps | Cloud

Debugging Microservices in Production with Distributed Tracing

Your production checkout flow just started returning 500 errors. Six microservices handle checkout. Logs show errors in three of them. Which service broke? Which error happened first? What caused the cascade? Traditional debugging doesn't work. You can't attach a debugger to production. Searching logs across six services gives thousands of lines with no obvious connection. By the time you correlate timestamps and trace IDs manually, customers have abandoned their carts.

Cloud Microservices Monitoring on AWS and Azure with OpenTelemetry

Your checkout flow starts in an AWS Lambda function, calls a payment service running on EKS, then triggers notifications through Azure Functions. Three different compute platforms, two cloud providers, one distributed trace that you can't see. Cloud providers want you to use their native monitoring tools. AWS pushes X-Ray and CloudWatch. Azure promotes Application Insights and Azure Monitor. These tools work well within their ecosystems but lock you into vendor-specific implementations.

OpsHelm goes multi-cloud with Aiven Diskless BYOC, cuts costs by 78% over MSK

In under a month, OpsHelm the continuous, enriched changelog for cloud infrastructure - migrated its streaming backbone from MSK and NATS to Aiven Diskless Kafka (BYOC on AWS). The switch eliminated cross-cloud networking fees, collapsed multiple storage layers into one, and cut total streaming costs by 5x (from >$50,000/year to <$10,000/year) while serving the team a single logical event bus that stretches across multiple regions and accounts.

Sending beers all across Belgium, a throwback to how we named Oh Dear

We're obviously a little biased, but we believe we have one of the best website monitoring tools on the market today, leading in features compared to our competitors. We've already tried a variety of marketing techniques to promote our service, but none really had the impact we were looking for. Maybe we're better at actually building good software than we are at marketing it? Or are we trying what everyone else is also doing, thus making it all harder?

3 things you can do to get closer to five nines

5 minutes. That’s how much downtime some of the world’s largest enterprises will tolerate. For most organizations, five nines (99.999%) of availability sounds like a pipedream. But the trick to increasing availability isn’t massive infrastructure spending or complex system redesigns. All it takes are three key practices that any team can adopt and implement. In this post, we’ll present these practices and how we implement them at Gremlin.

Agentic AIOps in Action: LogicMonitor, IBM, and Red Hat Deliver Self-Healing IT

Your most skilled engineers shouldn’t be spending nights and weekends piecing together root causes of outages. Yet many organizations still rely on manual incident response across sprawling hybrid and multi-cloud environments. The result: slower resolution times, frustrated customers and lost revenue that can reach up to $1 million per hour according to IDC. At LogicMonitor, we believe the answer isn’t just better monitoring. It is systems that can heal themselves.

A closer look at Grafana k6 browser: alignment with Playwright, modern features for frontend testing, and what's next

Over the years, we’ve seen our community embrace Grafana k6 browser as a key component of their frontend testing strategies. By helping collect frontend web vitals, capture custom metrics, and simulate user actions like clicking buttons or completing forms, the module offers teams a deeper understanding of performance and availability from their end users’ point of view.

Inside the InfluxDB 3 Plugin Ecosystem

Companies today face growing pressure to manage and analyze massive flows of time series data, from IoT sensors to cloud-native infrastructure. Storing this information is relatively straightforward. The greater obstacle is keeping it useful and consistent while balancing a wide range of tools and modern technology platforms that continue to evolve.

What the 2025 DORA Report Teaches Us About Observability and Platform Quality

The 2025 DORA State of AI-Assisted Software Development Report delivers a critical insight for technology leaders: AI is fundamentally an amplifier, not a solution. It magnifies the strengths of high-performing organizations with robust observability while exposing the dysfunctions of struggling ones. For organizations that have rushed to adopt AI coding assistants all while expecting immediate productivity gains, this finding demands a strategic pivot.