Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Top 7 Microservices Monitoring Tools to Consider in 2025

Let's talk about keeping those microservices in check. If you're running a distributed system (and who isn't these days?), you know the drill – more services mean more potential failure points. We've got the lowdown on the best microservices monitoring tools that'll have your back in 2025.

Dynatrace vs Elastic stack - A Detailed Comparison for 2025

Organizations looking for monitoring and observability solutions often compare ELK (Elasticsearch, Logstash, and Kibana) and Dynatrace. While both tools serve the purpose of log management and monitoring, their approaches, features, and use cases differ significantly. This article provides an in-depth ELK Stack vs Dynatrace comparison, helping users understand which tool best suits their needs.

Utilizing browser emulation and automation languages in digital experience monitoring

With multiple factors affecting the performance of online businesses, offering glitch-free transactions has become a necessity. A key component of delivering great user experience is effective digital experience monitoring(DEM), which involves closely tracking performance across different devices, browsers, and locations.

Debugging performance issues in Azure Service Bus

Azure Service Bus is a critical messaging service for building scalable cloud applications, but performance bottlenecks can lead to delayed message processing, throttling, or even dropped messages. It is essential to identify and resolve these issues to maintain smooth application workflows and prevent downtime. This blog explores common Azure Service Bus performance problems, provides step-by-step debugging strategies, and highlights how proactive monitoring can prevent recurring issues.

Top 10 Changes and Key Improvements in Apache Kafka 4.0.0

In this post, we summarize the major changes in the recently officially released Apache Kafka 4.0.0 version. We will look at the most notable features compared to the previous versions and explain what these changes mean in real production environments and what improvements they can bring to your streaming infrastructure.

Top 6 EC2 rightsizing recommendations that you can't ignore

Imagine a day at work where you realize that your team’s youngest developer has failed to kill a compute instance; the bill spikes and the budget is breached. Rightsizing recommendations would come to the rescue and play a crucial role in such situations by identifying underutilized, overutilized, or mismanaged resources and suggesting corrective actions.

Better CloudWatch Metrics in Honeycomb with the OpenTelemetry Collector

CloudWatch metrics can be a very useful source of information for a number of AWS services that don’t produce telemetry as well as instrumented code. There are also a number of useful metrics for non-web-request based functions, like metrics on concurrent database requests. We use them at Honeycomb to get statistics on load balancers and RDS instances. The Amazon Data Firehose is able to export directly to Honeycomb as well, which makes getting the data into Honeycomb straightforward.

Preventing Alert Storms with InfluxDB 3's Processing Engine Cache

A common problem in monitoring and alerting systems is not just alerting on what you’re seeing but preventing alert storms from overwhelming operators. When a system generates multiple notifications for the same incident, it leads to alert fatigue and can mask other important issues. For time series data, alert fatigue can result in missed anomalies, delayed responses to critical trends, and difficulty distinguishing real performance degradations from noise.

Dashboard updates: Fewer clicks, more control, faster widget building

You're reviewing your production metrics when suddenly an error spike appears on your dashboard. Your immediate thought isn't "how do I build a new view to investigate this?" but rather "how do I find out the cause quickly?" This is exactly what happened to one of our engineering teams last month when they spotted an unusual pattern in their API response times. Instead of running ad-hoc queries from scratch, they turned to a custom dashboard they had built after a past incident.