Operations | Monitoring | ITSM | DevOps | Cloud

Observe your AI agents: Endtoend tracing with OpenLIT and Grafana Cloud

In another post in this series, we discussed how to instrument large language model (LLM) calls. This can be a good starting point, but generative AI workloads increasingly rely on agents, which are systems that plan, call tools, reason, and act autonomously. And their non‑deterministic behavior makes incidents harder to diagnose, in part, because the same prompt can trigger different tool sequences and costs.

Monitor Model Context Protocol (MCP) servers with OpenLIT and Grafana Cloud

Large language models don’t work in a vacuum. They often rely on Model Context Protocol (MCP) servers to fetch additional context from external tools or data sources. MCP provides a standard way for AI agents to talk to tool servers, but this extra layer introduces complexity. Without visibility, an MCP server becomes a black box: you send a request and hope a tool answers. When something breaks, it’s hard to tell if the agent, the server or the downstream API failed.

Instrument zerocode observability for LLMs and agents on Kubernetes

Building AI services with large language models and agentic frameworks often means running complex microservices on Kubernetes. Observability is vital, but instrumenting every pod in a distributed system can quickly become a maintenance nightmare. OpenLIT Operator solves this problem by automatically injecting OpenTelemetry instrumentation into your AI workloads—no code changes or image rebuilds required.

How to migrate your paging tool without breaking your team

Most engineering teams don’t migrate their on-call and paging systems unless absolutely necessary. No matter how painful their current solution, it's one of those changes that people put off for as long as possible because the cost is real. The disruption, the retraining, the risk of missing a critical page during the transition. It's not something you do on a whim.

What is Kubernetes? Explained in 2 Minutes

What is Kubernetes, and how do companies like Netflix handle millions of users without crashing? In this quick guide, we break down Kubernetes in simple terms — from containers to pods, nodes, and the control plane — so you can understand how modern cloud applications stay reliable and scalable. Kubernetes acts like an air traffic controller for your apps, automatically managing where they run, restarting them if they fail, and balancing traffic across machines. Whether you're new to cloud computing or brushing up on DevOps basics, this video gives you a clear, beginner-friendly explanation.

Margaret Hamilton Coined "Software Engineering" Because Code Deserves the Same Rigor as Bridges

During International Women’s Month, we celebrate women whose technical work changed entire industries. But the lessons from engineers like Margaret Hamilton aren’t seasonal, they’re fundamental to how we should approach software development every single day. Margaret coined the term “software engineering” and built the code that landed humans on the moon. Her approach to rigor is as relevant to your next Git commit as it was to Apollo 11’s descent engine.

Back to fundamentals: 7 insights from Kelsey Hightower at HAProxyConf

Early in his career, Kelsey Hightower made a bet. The load balancer his team was running was consuming too much memory, and he was convinced he knew the fix. He told his manager: “If it doesn’t work, fire me. But I think I can make it work.” The fix was HAProxy. It was a story he shared publicly for the first time at HAProxyConf 2025, where he delivered a keynote address, “The Fundamentals.”

Benchmarking Kubernetes Log Collectors: vlagent, Vector, Fluent Bit, OpenTelemetry Collector, and more

At VictoriaMetrics, we built vlagent as a high-performance log collector for VictoriaLogs. To validate its performance and correctness under a real production-like load, we developed a benchmark suite and ran it against 8 popular log collectors. This post covers the methodology, throughput results, resource usage, and delivery correctness. Collectors under the test: We’ve made all benchmark configurations and source code public, so you can reproduce and verify the results independently.

Parallel Execution in Modern CI: Best Practices & Results | Harness Blog

Definition: Parallel execution in CI is the practice of running independent build, test, or deployment tasks concurrently to reduce feedback time, improve resource utilization, and control infrastructure costs. Developers often spend almost half their time waiting for builds that could be faster. Simply adding more resources is not enough. Real improvements come from planned parallelism, using concurrency together with test intelligence, caching, and strong governance.