Operations | Monitoring | ITSM | DevOps | Cloud

Automate your critical workflows with AI agents in 5 steps

Many teams remain bogged down by operational chaos and manual drudgery, even with access to a variety of automation solutions. These tools often operate in silos, creating disconnected islands of automation that require significant human effort to bridge. Agentic AI offers a path forward, creating a cohesive system that can intelligently and autonomously handle complex operational workflows.

Why Your Agentic AI Aspirations Need to Evolve from Models to a Workflow Data Fabric

Enterprise conversations today are dominated by one phrase: Agentic AI. Across boardrooms and innovation labs, organizations are experimenting with copilots, autonomous agents, and AI bots capable of resolving tickets, recommending actions, and orchestrating complex processes. The promise is real — AI that doesn't just generate insights, but takes meaningful action. Here's the uncomfortable truth: most enterprises are architecturally unprepared for the agentic future they're trying to build.

Understanding disaggregated GenAI model serving with llm-d

llm-d is an open source solution for managing high-scale, high-performance Large Language Model (LLM) deployments. LLMs are at the heart of generative AI – so when you chat with ChatGPT or Gemini, you’re talking to an LLM. Simple LLM deployments – where an LLM is deployed to a single server – can suffer from latency issues, even with just one user. This can be because of lack of memory-bandwidth on the server, or because of KV cache pressure on system memory.

SRE agent vs. traditional engineer: 7 key differences

The role of a Site Reliability Engineer (SRE) is evolving. The focus has shifted from simply working harder during an outage; A new kind of teammate is here to help: the SRE Agent. But what are the key differences when you compare an SRE agent versus a traditional site reliability engineer? This isn’t just a superficial change. It signifies a fundamental alteration in how teams construct and sustain dependable services.

Live Runtime Investigation in Claude Code with Lightrun MCP

In this video, Lightrun’s Dan Putman demonstrates what happens when Lightrun MCP is integrated within Claude Code. See how, once activated, Claude can ask specific questions about what services it can see and instrument in order to perform a deep investigation in production to get to a validated root cause analysis without the friction of redeploying or switching contexts.

Debug Live Production Apps in Codex with Lightrun MCP

Lightrun’s Dan Putman demonstrates the power of the latest Lightrun MCP skill. Watch how your AI code agent can now debug live applications directly in production. By connecting OpenAI's Codex to real-time runtime data via the Lightrun MCP, engineers can now generate and validate hypotheses using live telemetry and snapshots, without breaking flow. Ready to bring runtime context to your AI agents?

90% AI Adoption. Still Failing. DORA Explains Why.

AI adoption is nearly universal. So why are most teams still struggling? In this session from GitKon, Nathen Harvey, head of DORA at Google Cloud, shares findings from the 2025 DORA State of AI-Assisted Software Development report, drawing on data from nearly 5,000 developers worldwide. The answer isn't more AI. It's what surrounds it.

That's Not a Job for an LLM: The Right Way to Apply AI to Network Operations

LLMs have sucked all the oxygen out of the AI conversation — but AI is much more than just LLMs, and network engineers have been using AI techniques (machine learning, statistics, fuzzy logic, expert systems, neural networks) for decades. So what should LLMs be doing in network operations, what shouldn't they be doing, and how do agentic AI architectures fit in?