Operations | Monitoring | ITSM | DevOps | Cloud

Monitor the Health, Performance, and Security of Your AI Application Stack with AI Agent and AI Infrastructure Monitoring

At this year’s.conf25, we introduced an exciting new chapter in observability at Splunk — one that is unified, AI-powered, and agentic — to ensure ITOps and engineering teams are digitally resilient in the AI era.

Introducing Event iQ: Smarter Event Correlation in Splunk IT Service Intelligence (ITSI)

Every day, IT teams are flooded with alerts—thousands of messages about performance issues, service outages, or suspicious activity. With so many notifications, it’s easy to get overwhelmed, miss critical problems, or waste time chasing false alarms. Correlating related alerts into groups can help reduce the noise and make sense of everything, but setting up those correlations takes time, experience, and a lot of both system and historic knowledge.

DX UIM Hub Interconnectivity and the Benefits of Static Hubs

In DX Unified Infrastructure Management (DX UIM), there are multiple elements that need to work in harmony to achieve a high level of observability. Understanding the architecture of DX UIM can help you make configuration decisions that minimize resource consumption, without sacrificing the volume and granularity of observability data collected. In addition, using static hubs is a simple and particularly powerful option for specific situations.

4 Ways AppNeta Enhances Cost-Focused Cloud Planning

Enterprises are hemorrhaging their cloud budget: respondents in 49% of organizations estimate their cloud spending is wasted due to unchecked provisioning and lack of predictive cost governance. This cost inefficiency stems not just from financial blind spots, but also from operational gaps: poor visibility into network reliability and user experience. Real-time, end-to-end visibility is the foundation of cloud optimization.

Interactive Dashboards - Click Any Panel to Start Debugging

Your dashboard shows a latency spike. To investigate it, you copy the query, open logs in a new tab, paste and modify the query, lose your dashboard filters, and repeat for traces. By the time you find the issue, you have 15 tabs open. Starting today, you can click any panel and investigate right there. All your filters and variables carry over. No more tab juggling.

Measuring service response time and latency: How to perform a TCP check in Grafana Cloud Synthetic Monitoring

When your database stops accepting connections or your mail server becomes unreachable during business hours, the impact is immediate and costly. Fortunately, the right monitoring strategy can help you detect these TCP connection failures early on, and prevent them from impacting the user experience.

Honeycomb MCP Is Now In GA With Support for BubbleUp, Heatmaps, and Histograms

If you’ve been following my public journey with LLMs this year, it probably won’t surprise you to learn that this blog post is an announcement about the general availability of Honeycomb’s hosted MCP server. I want to share a few updates about what’s new in the GA release, discuss some interesting learnings from building it, and share examples of how we’re using MCP internally. First: if you're still in the dark about MCP and AI agents, go read the earlier blogs I linked.

The Answer to SRE Agent Failures: Context Engineering

AI agents for SREs were supposed to slash mean time to resolution and eliminate alert fatigue. Instead, most teams got expensive, unreliable tools that burn through tokens without delivering insights. But what if the problem isn't the AI models themselves? Recent benchmarking reveals the real bottleneck: context engineering. When we tested our context engineering approach against conventional methods, the results were dramatic: Scroll down for our benchmark results to see the full comparison.

Why it's time to move beyond APM: Monitoring from the user's perspective

For years, organizations have relied on Application Performance Monitoring (APM) as the backbone of their observability strategy. The idea was simple: collect as many logs, metrics, and traces as possible, then sift through the data to uncover insights. But as applications have shifted to the cloud and become increasingly API-driven, that model has broken down.

Subsea Cables Parted in Red Sea Again

This past weekend saw the latest round of submarine cable cuts to impact internet connectivity between Europe and Asia. And once again they took place in the Red Sea, an historic problem area for subsea cables. In this post, I review some of the impacts that we observed in both the loss of transit in affected countries as well as increased latencies between public cloud regions using Kentik’s Cloud Latency Map.