Operations | Monitoring | ITSM | DevOps | Cloud

Debugging multi-agent AI: When the failure is in the space between agents

I've been building a multi-agent research system. The idea is simple: give it a controversial technical topic like "Should we rewrite our Python backend in Rust?", and three agents work on it. An Advocate argues for it, a Skeptic argues against, and a Synthesizer reads both briefs blind and produces a balanced analysis. Each agent has its own model, its own tools, its own system prompt. It worked great in testing. Then I noticed the Synthesizer kept producing analyses that leaned heavily toward one side.

The End of Manual Instrumentation: Scaling Observability with OTel OBI & Coralogix

Traditionally, achieving deep visibility into distributed systems required significant trade-offs in engineering time. Collecting meaningful application metrics and traces required teams to embed language-specific agents, modify source code, or manage complex library dependencies across every service.

VictoriaMetrics at KubeCon: Optimizing Tail Sampling in OpenTelemetry with Retroactive Sampling

Last month, the VictoriaMetrics team gave a talk on retroactive sampling at KubeCon Europe 2026. By writing this blog post, as a transcript of the session, we want to explain how retroactive sampling reduces outbound traffic, CPU, and memory usage in the data collection pipeline significantly compared to tail sampling in OpenTelemetry.

Smarter Alert Management: Test on Historical Data, Review Transitions, and Preview Silencing Schedules

Alert fatigue usually isn’t caused by one thing. It’s the accumulation of thresholds that are slightly too sensitive, alerts that fire during known maintenance windows, and historical patterns that nobody has the tools to review easily. Fixing it requires better visibility into how alerts actually behave over time, and a way to test changes before they hit production. We’ve shipped three improvements to alerting in Netdata that address different parts of this problem.

Faster code doesn't mean faster delivery

Software development has never moved this fast. JetBrains' 2026 AI Pulse Survey found that 90% of developers now use at least one AI tool at work. CircleCI's 2026 State of Software Delivery report, covering 28 million workflows across 22,000 organizations, found that daily CI workflow runs jumped 59% year over year, the largest single increase they've ever recorded. In that same period, CI success rates dropped to a five-year low.

Deployed Is Not the Same as Ready: How Mature Is Your Kubernetes Environment?

Kubernetes adoption is no longer the challenge it once was. More than 82% of enterprises run containers in production, most of them on multiple Kubernetes clusters. Adoption, however, does not mean operational maturity. These are two very different things. It is one thing to deploy workloads to a cluster or two and quite another to do it securely, efficiently and at scale. This distinction matters because the gap between adoption and Kubernetes operational maturity is where risk accumulates.

PagerDuty Invests in the AI-First Operations and Resilience of Healthcare and Crisis Response Organizations

At PagerDuty, we believe operational excellence and social impact are inseparable. As AI rapidly transforms how nonprofits operate, our AI and agentic technology empower mission-driven teams to automate complexity and focus their limited resources on what matters most: delivering reliable services that create meaningful impact at scale.

CircleCI is now available as a Codex plugin

CircleCI is part of the latest wave of Codex plugin integrations, joining the directory alongside other popular development tools like Vercel, Cloudflare, Figma, Notion, Sentry, Hugging Face, Linear, and more. If you’re using Codex, you already know that writing code is rarely the hardest part of your job. It’s the delays, interruptions, and context switching that start when that code breaks on its way to production. The CircleCI Codex plugin closes that gap.

The Edwin AI Agent Orchestrator: Coordinated Incident Investigation Across the Tools You Already Use

Edwin AI’s Agent Orchestrator keeps incident investigation, context, and response aligned as work moves across tools, eliminating the manual handoffs that slow resolution. Every major incident has two timelines running in parallel. The first is the incident itself—services degrading, users affected, business impact accumulating. The second is quieter and just as costly: engineers switching tabs, re-explaining context to new responders, moving notes from one tool to another by hand.