Operations | Monitoring | ITSM | DevOps | Cloud

NVIDIA's Jensen Huang just described your next big cost problem

On March 18, Jensen Huang took the stage at NVIDIA’s GTC conference in San Jose for a keynote that ran well over two hours — covering everything from CUDA’s 20-year history to humanoid robots that may one day wander Disneyland. But buried inside the spectacle was a remarkably clear-eyed articulation of the economic forces now bearing down on every enterprise that builds on cloud infrastructure.

Annotate traces to improve LLM quality with Datadog LLM Observability

LLM applications rarely crash. They degrade quietly. Once these applications are shipped to production, subtle quality failures become harder to catch with traditional signals. Tone shifts, hallucinated details, off-topic responses, and incomplete reasoning can emerge while latency and token usage look stable.

The Hidden AI Bill: Why Non-Prod LLM Costs Spiral

Most teams know they are spending money on AI in production. Far fewer realize how much they are spending outside production. It’s easy to get lost as you evaluate which model has the best responses, is fast enough, and cheap enough to run in production. That is because the AI bill usually shows up as a giant blob. It is easy to see the total.

Monitor Model Context Protocol (MCP) servers with OpenLIT and Grafana Cloud

Large language models don’t work in a vacuum. They often rely on Model Context Protocol (MCP) servers to fetch additional context from external tools or data sources. MCP provides a standard way for AI agents to talk to tool servers, but this extra layer introduces complexity. Without visibility, an MCP server becomes a black box: you send a request and hope a tool answers. When something breaks, it’s hard to tell if the agent, the server or the downstream API failed.

Observe your AI agents: Endtoend tracing with OpenLIT and Grafana Cloud

In another post in this series, we discussed how to instrument large language model (LLM) calls. This can be a good starting point, but generative AI workloads increasingly rely on agents, which are systems that plan, call tools, reason, and act autonomously. And their non‑deterministic behavior makes incidents harder to diagnose, in part, because the same prompt can trigger different tool sequences and costs.

How to monitor LLMs in production with Grafana Cloud,OpenLIT, and OpenTelemetry

Moving a large language model (LLM) application from a demo to a production‑scale service raises very different questions than the ones you ask when playing with an API key in a notebook. In production, you have to answer: How much is each model costing us? Are we keeping latency within our service‑level objectives? Are we accidentally returning hallucinations or toxic content? Is the system vulnerable to prompt‑injection attacks?

Seer fixes Seer: How Seer pointed us toward a bug and helped fix an outage

Seer is our AI agent that takes bugs and uses all of the context Sentry has to find the root cause and suggest a fix. We use it all the time to help us improve Sentry. Seer fixes Sentry. More recently, Seer has been helping us fix itself — Seer fixing Seer. An upstream outage triggered a bit of an avalanche, revealing a bug that had been hiding away for months. When it came time to fix it, Seer pointed us exactly where we needed to look.

Harness AI for Argo CD

Managing GitOps at scale shouldn’t feel like an endless game of "Whac-A-Mole." In this 3-minute demo, we show how Harness AI moves beyond simple syncs to provide agentic troubleshooting and automated orchestration for your entire GitOps estate. Watch as we use the Harness DevOps Agent to: Identify Common Failure Patterns: Instead of clicking through individual clusters, we ask the AI to analyze 4 out-of-sync applications simultaneously.