Operations | Monitoring | ITSM | DevOps | Cloud

Interactive Dashboards - Click Any Panel to Start Debugging

Your dashboard shows a latency spike. To investigate it, you copy the query, open logs in a new tab, paste and modify the query, lose your dashboard filters, and repeat for traces. By the time you find the issue, you have 15 tabs open. Starting today, you can click any panel and investigate right there. All your filters and variables carry over. No more tab juggling.

Measuring service response time and latency: How to perform a TCP check in Grafana Cloud Synthetic Monitoring

When your database stops accepting connections or your mail server becomes unreachable during business hours, the impact is immediate and costly. Fortunately, the right monitoring strategy can help you detect these TCP connection failures early on, and prevent them from impacting the user experience.

Honeycomb MCP Is Now In GA With Support for BubbleUp, Heatmaps, and Histograms

If you’ve been following my public journey with LLMs this year, it probably won’t surprise you to learn that this blog post is an announcement about the general availability of Honeycomb’s hosted MCP server. I want to share a few updates about what’s new in the GA release, discuss some interesting learnings from building it, and share examples of how we’re using MCP internally. First: if you're still in the dark about MCP and AI agents, go read the earlier blogs I linked.

The Answer to SRE Agent Failures: Context Engineering

AI agents for SREs were supposed to slash mean time to resolution and eliminate alert fatigue. Instead, most teams got expensive, unreliable tools that burn through tokens without delivering insights. But what if the problem isn't the AI models themselves? Recent benchmarking reveals the real bottleneck: context engineering. When we tested our context engineering approach against conventional methods, the results were dramatic: Scroll down for our benchmark results to see the full comparison.

Capacity Planning Still a Major Issue for Data Center Managers

Uptime Institute’s 2025 Global Data Center Survey shows that capacity planning remains a top challenge for operators. Nearly one-third of vendors identify forecasting future capacity requirements as their customers’ single biggest issue, more than any other concern. Modern data centers face new complexities as digital services expand and hybrid IT architectures shift workloads across on-premises, colocation, and cloud environments.

Why it's time to move beyond APM: Monitoring from the user's perspective

For years, organizations have relied on Application Performance Monitoring (APM) as the backbone of their observability strategy. The idea was simple: collect as many logs, metrics, and traces as possible, then sift through the data to uncover insights. But as applications have shifted to the cloud and become increasingly API-driven, that model has broken down.

The Enterprise Automation Platform Driving the Zero-Ticket Future

The surge of interest in artificial intelligence has opened exciting new doors, but many CIOs are finding themselves in the same bind: lots of promising pilots, but very few at-scale results. Intelligent agents can interpret requests, classify tickets, and even recommend fixes, but unless they are connected into broader workflows, these efforts remain isolated experiments.

Subsea Cables Parted in Red Sea Again

This past weekend saw the latest round of submarine cable cuts to impact internet connectivity between Europe and Asia. And once again they took place in the Red Sea, an historic problem area for subsea cables. In this post, I review some of the impacts that we observed in both the loss of transit in affected countries as well as increased latencies between public cloud regions using Kentik’s Cloud Latency Map.