Scaling AI Reliability: Real world lessons from Mistral AI
How does one of the world's leading AI companies keep its infrastructure reliable while shipping new models constantly? In this webinar, Devon Mizelle, Senior SRE at Mistral AI, shares the real story.
Devon walks through how Mistral built an automated system that generates synthetic checks for every model the moment it goes live—no manual configuration, no forgotten monitors, no inconsistent alerting. Using monitoring as code, his team eliminated the toil of maintaining hundreds of checks across a rapidly evolving model ecosystem.
But this isn't just a technical deep dive. The conversation explores where observability is headed in the AI era: What happens when agents get paged before humans? How close are we to self-healing systems? And what does this mean for the future of SRE?
Featuring:
- Devon Mizelle, Senior Site Reliability Engineer at Mistral AI
- Sylvain Kalache, Head of AI Lab at Rootly
- Giovanni Rago, Head of Customer Solutions at Checkly
⏱️ Timestamps:
0:00 Introduction
2:58 Meet the speakers
5:39 How Mistral uses monitoring and incident management tools
8:39 The problem: what wasn't working before
12:54 The solution: infrastructure as code for monitoring
18:10 Walking through the Terraform implementation
23:30 Configuring alert routing automatically
28:14 Results: happier developers, fewer tools, less toil
33:10 The future of monitoring at scale with AI
36:09 Self-healing checks and automated remediation
43:02 AI SRE: when agents get paged first
50:11 Will AI replace incident management for SREs?
55:16 Q&A: Alert grouping, webhooks, and testing LLM outputs
1:05:37 Wrap-up