Operations | Monitoring | ITSM | DevOps | Cloud

AI Observability: How to Keep LLMs, RAG, and Agents Reliable in Production

AI observability closes the gap between “something’s wrong” and “here’s what to fix.” If you run AI in production, you might have felt the whiplash. Yesterday, your LLM answered in 300 milliseconds (ms). Today p99 crawls, costs spike, and nobody’s sure if the culprit is model behavior, data freshness, or GPUs stuck at the ceiling. Dashboards light up, but they don’t tell you which issue puts customers at risk. That’s the gap AI observability closes.

What Are AI Workloads? Everything Ops Teams Need to Know

AI workloads break every assumption you have about infrastructure management. AI is everywhere. Machine learning-based tools are answering customer service questions, accelerating incident resolution, catching fraudulent transactions, spotting defects on production lines, and powering late-night searches that delve into the random topic that pops into your head right before bedtime. Behind every prediction, response, or generated sentence is massive computing power doing serious, continuous work.

AI Monitoring, Explained: Challenges, Core Components, and Why Observability Is the Next Step

Monitoring AI systems isn’t business as usual. Monitoring AI isn’t like monitoring traditional systems. You can’t just track uptime or response times and call it a day. AI models evolve, data shifts, and behavior drifts over time, which means your monitoring has to evolve, too. If you’re running AI workloads in production, you already know this. Your models might look healthy according to your infrastructure metrics, but they’re still making bad predictions.

AI for Good: Securing Networks in the Age of Autonomous Attacks

The rise of autonomous AI attacks operating at machine speed demands that network security evolve beyond human capacity and manual processes. Kentik AI Advisor counters this threat by using AI for good, reasoning across full network context to proactively eliminate vulnerabilities and guide immediate, confident defense.

Architecture for the agentic era: How AI will reshape data, security, and observability

As AI agents move from copilots to autonomous systems, they’re generating and consuming data at unprecedented scale. The result is a new kind of infrastructure pressure — one that’s quietly reshaping how organizations think about data, cost, and control. Across IT, Security, and Observability, leaders are realizing a hard truth: too much data is too costly.

AI Isn't Here to Replace Your Dashboard... Yet

Non-deterministic UIs are the future and will replace your dashboards, but they’re not here yet. So until then, we’re stuck with conversational interfaces. In an effort to try and describe what I consider the future of UIs to look like, I wrote about how you (and I) have been designing dashboards wrong. The core insight was that we've been designing for static representations of data that sit on a TV in the office, when the actual use case is someone at a desk using them to debug an issue.

The Human Touch in AI Chatbots: Balancing Automation and Personalization

Artificial intelligence (AI) is transforming how companies engage with customers. Businesses are increasingly expected to provide instant, accurate, and personalized responses across multiple channels, from websites and apps to social media platforms. AI chatbots have emerged as essential tools in meeting these expectations, enabling businesses to streamline communication, reduce response times, and provide consistent support around the clock.

Audio to Text: Enhancing Collaboration and Documentation for Distributed Tech Teams

In the age of cloud computing, DevOps, and distributed IT operations, remote technology teams are now the norm. Global teams bring exceptional talent but also face unique challenges-language barriers, time zone hurdles, incomplete documentation, and gaps in institutional knowledge. As organizations increasingly rely on virtual meetings and asynchronous communication, the demand for reliable audio-to-text solutions is surging.

Better integration tests in Cursor using proxymock

Cursor is fantastic at cranking out code changes. I recently used it to splice a brand-new downstream API call into one of our Go microservices, and the diff looked great. The unit tests finished before I lifted my coffee mug, yet I still had zero certainty the change would survive contact with real traffic. That gap is all about integration tests, so I paired Cursor with proxymock and the outerspace-go demo service to prove the behavior end to end.