Operations | Monitoring | ITSM | DevOps | Cloud

Fear, Identity & Flaky Tests: AI in Reliability w/ Dana Lawson (CTO, Netlify)

The self-healing systems that SREs have dreamed about for a decade aren't a distant promise anymore — they're already being built, and the biggest barrier left is cultural. Dana Lawson, CTO at Netlify, has spent over 25 years in the trenches of developer infrastructure, from sysadmin roots to running the platform that powers 5% of the internet.

The Incident You Never Had: Deterministic Simulations w/ Will Wilson (Antithesis CEO)

Most reliability engineering happens after something breaks. Will Wilson thinks that's the wrong place to be. As co-founder and CEO of Antithesis, the autonomous testing platform that just raised $105M in a Series A led by Jane Street, Will has spent years building the infrastructure to catch failure modes before they ever reach production. His starting point is uncomfortable: the testing practices most teams rely on are structurally incapable of finding the bugs that cause real incidents.

Burnout Doesn't Ask Permission: Recognizing, Recovering, and Rebuilding w/ Stephen Townsend

Burnout doesn't announce itself. For Stephen Townsend, SRE team lead and host of the Slight Reliability podcast, it crept in over months of mounting pressure on a massive transformation program, and announced itself overnight with an inability to sleep. In this episode, Stephen shares his personal burnout story with rare honesty: the physical symptoms he dismissed, the org structure that left him without autonomy, and the full year it took to recover.