Scaling AI Reliability: Real world lessons from Mistral AI

Jan 28, 2026

How does one of the world's leading AI companies keep its infrastructure reliable while shipping new models constantly? In this webinar, Devon Mizelle, Senior SRE at Mistral AI, shares the real story.

Devon walks through how Mistral built an automated system that generates synthetic checks for every model the moment it goes live—no manual configuration, no forgotten monitors, no inconsistent alerting. Using monitoring as code, his team eliminated the toil of maintaining hundreds of checks across a rapidly evolving model ecosystem.

But this isn't just a technical deep dive. The conversation explores where observability is headed in the AI era: What happens when agents get paged before humans? How close are we to self-healing systems? And what does this mean for the future of SRE?

Featuring:

  • Devon Mizelle, Senior Site Reliability Engineer at Mistral AI
  • Sylvain Kalache, Head of AI Lab at Rootly
  • Giovanni Rago, Head of Customer Solutions at Checkly

⏱️ Timestamps:

0:00 Introduction

2:58 Meet the speakers

5:39 How Mistral uses monitoring and incident management tools

8:39 The problem: what wasn't working before

12:54 The solution: infrastructure as code for monitoring

18:10 Walking through the Terraform implementation

23:30 Configuring alert routing automatically

28:14 Results: happier developers, fewer tools, less toil

33:10 The future of monitoring at scale with AI

36:09 Self-healing checks and automated remediation

43:02 AI SRE: when agents get paged first

50:11 Will AI replace incident management for SREs?

55:16 Q&A: Alert grouping, webhooks, and testing LLM outputs
1:05:37 Wrap-up