Can you actually trust an AI agent? In this pre-recorded episode of The Context Window, Nicole van der Hoeven sits down with Yas Ekinci, an engineer on the Grafana AI team, to talk about evals — how Grafana measures the quality and reliability of the AI it ships. They get into the difference between online and offline evals, why reviewing AI-generated code has become the real bottleneck, the "final answer problem" of plausible-but-wrong outputs, and o11y-bench, Grafana's open benchmark for observability agents. Along the way.