Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Working as a remote engineer at Cribl | Building the AI Platform for Telemetry

Learn what it’s like to work as an engineer at Cribl, a remote-first company building the AI platform for IT and security data. In this recruiting video, Cribl’s engineering and support leaders share how fully distributed teams collaborate, solve hard data problems, and grow their careers while working from around the world. You’ll hear from managers and leaders in site reliability engineering, security incubation, and technical support about.

KWhy? MSP Webinar

Most MSPs are sitting on a goldmine of data across their tools. The problem isn’t access, it’s knowing what *actually* matters… and how to use it to drive better outcomes. Join Amanda Doucette-Lachapelle and Kyle Christensen (Empath) as they walk through how to use KPIs to make smarter, more confident decisions, with real examples you can apply right away.

The Data Plane Reality: OTel Scales, While Topology UX Lags

OpenTelemetry won the architectural standards battle. At scale, though, telemetry breaks more like plumbing than code. It breaks quietly, across a graph, with a blast radius you don’t understand until it’s expensive. With over 65% of organizations now running more than 10 collectors in production, hybrid deployments across Kubernetes and VMs are accelerating fast. Telemetry standardization is no longer a project milestone. It is a baseline expectation.

Service Level Agreement (SLA) Templates: Examples, Metrics, and Best Practices

How quickly should your team resolve a critical ticket, and what are the consequences when it misses the target? That is exactly where Service Level Agreements (SLAs) come into play. An SLA turns service expectations into measurable commitments by defining clear response and resolution targets. Rather than starting from scratch, an SLA template provides a structured foundation for establishing those commitments and tracking performance against agreed standards. Why does that matter?

Agent Timeline Is Now Generally Available

A few weeks ago I wrote about a customer’s refund request that stopped halfway through at 11:47 p.m. on a Tuesday night. That post walked through the 40 minutes it took to work out what happened when an agentic application had a problem: a tool retried against a rate-limited payments API, the error responses filled up the context window, and the agent gave up. The whole reason we built Agent Timeline was to turn that 40 minutes into five. To reduce MTTR. To solve the problem and get back to sleep.

The Second Edition of Observability Engineering Is Here

IT’S HERE it’s here it’s here it’s here!!!! The second edition of Observability Engineering is available for download, and since Honeycomb is the sponsor, you can now download it from our website (the dead tree version will take another month). This is a strange time to be writing a book.

Troubleshooting ActiveMQ Producer Flow Control Blocks

The alert comes in at 2 AM: your order processing service is unresponsive. The application is not crashed, threads are running, the JVM is healthy, but no messages are being sent. Your operations team traces it to a blocked send() call on an ActiveMQ connection. Hours later, after restarting the application, someone finds this line in the broker log from 11 PM the previous day.

5 Alternatives to Prometheus in 2026

Prometheus is a battle-tested, flexible and, most importantly, free tool that has long been the go-to open-source monitoring solution. Much of its popularity came down to its simplicity. A few years have gone by, though, and the APM space has gotten pretty crowded. Developers are now starting to move away from the complexity of self-hosting, and OpenTelemetry stands out as one of the CNCF’s fastest-expanding projects. In fact, it’s now among the most adopted telemetry frameworks out there.

Monitoring website that redirects to a different URL

Is it necessary to monitor a website that redirects to a different URL? Imagine a user visits a URL and is automatically redirected to a new main URL without taking any action. This process is called URL redirection. It typically occurs when a web server sends a 3xx HTTP status code and a location header with the new URL. Sometimes there is only one redirect, but in other cases, the request passes through several URLs before reaching the final page.