Operations | Monitoring | ITSM | DevOps | Cloud

Full-stack observability in Grafana Cloud: How to investigate issues across services and infrastructure

Many times, the hardest part of troubleshooting isn’t fixing the actual problem. It’s figuring out where to start. As engineers, it’s easy to lose count of how many times we’ve opened logs, then 10 metrics tabs, and another 10 tabs with trace queries, only to end up back in the logs trying to find a root cause.
Sponsored Post

From Dashboards to Conversational AI: The Evolution of UI in IT Products

The way IT teams interact with technology has changed dramatically over the years. From early text-based interfaces to today's dashboards and now conversational AI, each stage has reshaped how we monitor, diagnose, and understand complex IT environments. But while dashboards gave us visibility, they often led to more questions than answers. In this post, we briefly explore the evolution of UI in IT products and how conversational AI is bridging the gap between data and understanding.

Getting started with Microsoft Defender dashboards

Microsoft Defender does a great job protecting you and your organization from online threats. It is constantly working to detect and collect security data so you don’t have to worry about falling behind on incidents and vulnerabilities. The Defender portal can also provide great insights into that data, but connecting it to the rest of your stack is difficult.

Chart Your Team's Analytics Journey with Customizable Dashboards in DX NetOps

DX NetOps now features customizable dashboards that give all users some important new features and capabilities. In addition, with the solution’s new integration capabilities, DX NetOps enables users of current analytics and reporting tools to add standardized dashboards over time.

Overview of AI Evaluation (The Context Window #05)

Can you actually trust an AI agent? In this pre-recorded episode of The Context Window, Nicole van der Hoeven sits down with Yas Ekinci, an engineer on the Grafana AI team, to talk about evals — how Grafana measures the quality and reliability of the AI it ships. They get into the difference between online and offline evals, why reviewing AI-generated code has become the real bottleneck, the "final answer problem" of plausible-but-wrong outputs, and o11y-bench, Grafana's open benchmark for observability agents. Along the way.

How Grafana Cloud Ingests Your Data | Data Sources, Alloy & OTel Explained

Learn the two main ways to get data into Grafana Cloud. In this video, we break down how Grafana Cloud connects to over 150 external data sources (like Salesforce, Postgres, and CloudWatch) where your data stays in place, and how you can send raw telemetry into Grafana’s fully managed databases for logs, metrics, traces, and profiles.

Grafana 13.1 release: observability as code updates, extending Grafana Assistant across more data sources, and more

Earlier this year, Grafana 13 laid the groundwork for making it easier and faster than ever to turn your data into actionable insights. With our latest minor release, Grafana 13.1, we're building on that foundation, expanding observability as code, bringing Grafana Assistant to more data sources, and streamlining the everyday workflows teams rely on to visualize, analyze, and act on their data. Download Grafana 13.1 Below are just some of the highlights from Grafana 13.1.

Vendor Outage Monitoring for MSPs: Per-Client Status Pages and Custom Dashboards

Handling client calls when a third-party vendor has an outage - this will sound familiar if you are a managed service provider (MSP). Your first instinct would be to check if the vendor's status page or social media handle shows anything, or check crowdsourced websites like Downdetector. Or even ask your client to check themselves. These approaches do not scale when you have more than a few clients, many vendor status pages to check, and clients with different stacks.

Observability for a Privacy-first AI Wearable | Grafana Everywhere

Trust is everything when AI gets personal. Golden Grot Award winner and NeoSapien co-founder and CEO Dhananjay Yadav shares how his team uses Grafana Assistant to ensure the privacy-first AI wearable delivers a seamless, reliable experience without compromising its mission. Because when AI moves closer to our everyday lives, teams need to know what’s happening — and users need to trust that it’s working as intended.

Inside the AI Team Weekly: AI Observability workflows and Prometheus exemplars (May 19th, 2026)

The Grafana AI team (Engineers Ivana Huckova and Sonia Aguilar) share what's new in AI Observability this week: a new way to instrument and visualize agent workflows, plus a neat trick for jumping straight from a metric spike to the exact conversation that caused it using Prometheus exemplars. In this episode: We're showing parts of our team meetings to build in public in some small way and give you a sneak preview of what's to come. But not all features we show may make it to production! You've been warned. :)

Analysing Claude Code telemetry with SquaredUp - diving deeper

In our previous article we looked at the basics of: In this article, we are going to take a deeper dive into some of the complexities of configuration as well as some of the nuances of analysing Claude telemetry. Before we dive into the code, let us just remind ourselves that our telemetry pipeline looks like this: That is, we are emitting Claude Code telemetry to an OpenTelemetry Collector. The telemetry is then exported to an Application Insights endpoint and stored in Log Analytics tables.

How to use Postman Visualizer: a step-by-step guide

API responses are often easier to understand when they are displayed visually instead of as raw JSON. While Postman is widely used for testing APIs, many developers overlook one of its most useful features which is the Postman Visualizer. While it is not as fully featured as a dedicated dashboarding platform like SquaredUp, it is a great way to quickly visualize API responses during development and debugging.

Visualising Claude Code telemetry in SquaredUp

Engineering teams are shipping more AI-generated code than ever, but at what cost? Learn how to build a telemetry pipeline to monitor Claude Code usage and costs directly in SquaredUp. It is estimated that 85-90% of engineering teams are now using AI coding assistants such as Claude, Codex and Cursor. This is not just for small-scale pilot projects— around 40% of all code now being shipped is AI-generated, and in start-ups the figure is around 95%. This can result in incredible productivity gains.

Getting started with Prometheus dashboards

Prometheus is a wildly popular open source monitoring tool typically used for monitoring Kubernetes environments and containerized workloads. But how do you turn the mountains of metrics into a clear picture of health and performance? SquaredUp plugs directly into your Prometheus database to visualize and monitor your data. What sets SquaredUp apart from other Prometheus visualization options like Grafana and Perseus is just how easy it is to visualize, monitor and share Prometheus dashboards.

Grafana Tempo: The distributed tracing journey to 3.0 (June 2026 Community Call)

Our distributed tracing journey from the inception of Tempo to 3.0. Can't comment in the chat? You may need to create a channel. Grafana Cloud is the easiest way to get started with Grafana dashboards, metrics, logs, traces, and profiles.

From API to live dashboard - building a SquaredUp plugin with AI

No matter how fast we build, we'll never integrate with every tool. There are too many, new ones appear constantly, and some are too niche to ever reach the top of our roadmap. So if the tool you care about isn't supported yet, your options have been to wait for us to get to it, or build it yourself with our Web API plugin — a powerful, flexible option, though one that asks you to map out the endpoints, authentication and paging yourself.

Automatically discover and remediate root causes with Grafana Assistant Investigations

You can use Grafana Assistant Investigations to automatically discover incidents and help find root causes—and this AI-powered Grafana Cloud feature recently got a major upgrade to give you even more confidence in its findings. You can read more about the behind-the-scenes effort in our new engineering blog Unprompted, where we get into harness engineering, context compaction, benchmarking, and keeping agents alive and working well in long-running sessions.

Asimov's Zeroth Law of Robotics: testing and observing AI (ExpoQA 2026)

Asimov's Three Laws of Robotics are missing one — and when it comes to testing and observing AI, Nicole van der Hoeven argues that missing rule changes everything: before a robot can avoid harm, obey orders, or protect itself, there has to be a Zeroth Law: a robot must be observable. Because if you can't see what a system is doing, you have no way of knowing whether it's following any rule at all.

Why Engineers Don't Trust Autonomous AI - 4th Annual Observability Survey | Grafana Labs

The 2026 Observability Survey from Grafana Labs heard from over 1,300 engineers and leaders across 76 countries on the real-world role of AI in observability. The data reveals a sharp distinction between intelligence and autonomy — and a critical blind spot most teams have.

AI Observability Deep Dive Demo | Grafana Cloud

Grafana AI Observability is our new database and platform for observing AI Agents. Over the past year at Grafana Labs, we built Agents and we needed a way to understand how they are performing, what are the costs associated with them, what's the error rate or time to the first token as well as how they are behaving. Grafana Staff Engineer, Ivana Hučková provides a deep dive demo on how Grafana AI Observability connects our experience building Agents with our experience building observability systems.

Grafana Assistant Context Offloading

Context Offloading is a pipeline solution for managing Observability with AI Agents. If you are building AI Agents that work with real data, the context window can very easily get filled with bloated context that the Agent does not really need. Sven demonstrates "Context Offloading", a solution that stores the JSON result and sends only the summary of the JSON blob, making the LLM loop performance much quicker and keeping your context window small.

Observability for Healthcare Systems | Grafana Everywhere

Grafana Assistant is going places you might not expect — including healthcare. Golden Grot winner Oren Lion from TeleTracking reveals how Grafana Cloud supports their systems that help keep patient care moving — and how Assistant enables teams to get from “what happened?” to “here’s why” faster. From moon landings to patient care, Grafana is everywhere. Congratulations to Oren, Chris Johnson, Mark Munson, and the entire TeleTracking team on winning this year's Golden Grot Award for Pioneering AI in Observability!

Getting Started with NinjaOne dashboards

If you manage endpoints for a living, you'll know the problem isn't a lack of data. It's that there's too much of it, scattered across too many places. A modern IT team or MSP might be looking after thousands of devices spread across dozens of customer organizations, each generating a constant stream of alerts, patch results, antivirus events and disk warnings. NinjaOne does a great job of collecting all of that.

How to generate real-world load tests using Grafana Cloud k6 and production telemetry

For many development teams, a load test starts with a set of assumptions. You pick 100 virtual users because it sounds reasonable. You ramp for 30 seconds because that's what the tutorial showed. You set a 500ms threshold because it feels like a good target. The test passes, you ship the release, and production falls over at 6 p.m. on a Tuesday because your synthetic load never resembled how real users interact with your application.

Tempo 3.0 release: a new architecture for scale and lower TCO, TraceQL metrics GA, and more

Tempo started with a simple goal: make distributed tracing easier to run at scale. As tracing adoption has grown, however, so have the challenges, including higher data volumes, more complex architectures, and increasing demand for real-time insights directly from traces. Over the last year, we’ve been evolving Tempo’s architecture to meet that moment. And today, we’re sharing the results of those efforts with the release of Tempo 3.0.

Inside the Grafana AI Team Weekly: AI Observability for the OTel demo and LLMSpec (May 12, 2026)

This is an excerpt from a real AI team weekly meeting where we talk about the stuff we build and occasionally also demo them! In this one, Principal Software Engineer Sven Großmann demos how he integrated AI Observability into the OTel demo, complete with the guards feature he introduced last week, and Principal Software Engineer Yas Ekinci gives a rare glimpse of LLMSpec, the internal counterpart of the o11ybench benchmark that we use to evaluate Assistant.

What's New in Tempo 3.0

Tempo 3.0 introduces a major architectural shift that decouples the read and write paths, with Kafka handling durability on the write side and a new live store serving recent traces on the read side. Blocks are now written at a replication factor of one instead of three, significantly reducing storage overhead. This release also brings TraceQL metrics to general availability, adds comparison operators for filtering metric results at query time, and introduces a new Tempo CLI redact command for removing sensitive trace data on demand without waiting for retention to expire.