Debugging multi-agent AI: When the failure is in the space between agents
I've been building a multi-agent research system. The idea is simple: give it a controversial technical topic like "Should we rewrite our Python backend in Rust?", and three agents work on it. An Advocate argues for it, a Skeptic argues against, and a Synthesizer reads both briefs blind and produces a balanced analysis. Each agent has its own model, its own tools, its own system prompt. It worked great in testing. Then I noticed the Synthesizer kept producing analyses that leaned heavily toward one side.