Operations | Monitoring | ITSM | DevOps | Cloud

In the Age of AI, Taste Isn't About Aesthetics

AI can generate a UI in seconds. So what do designers actually bring to the table? Marcela, Principal Product Designer at Rootly and former Founding Designer at Ramp, has spent 20 years in design. Her answer: taste isn't about aesthetics or crafting pleasant interactions. It's about asking the uncomfortable questions, and choosing the right problem, not the easiest one.

AI Didn't Change the Game, It Just Exposed Your Bottlenecks w/ Ganesh Datta (CTO, Cortex)

Every engineering org says they want to improve reliability — but most can't even agree on what "good" looks like. Ganesh Datta, Co-Founder and CTO of Cortex, has spent the better part of a decade helping companies confront that gap.

Fear, Identity & Flaky Tests: AI in Reliability w/ Dana Lawson (CTO, Netlify)

The self-healing systems that SREs have dreamed about for a decade aren't a distant promise anymore — they're already being built, and the biggest barrier left is cultural. Dana Lawson, CTO at Netlify, has spent over 25 years in the trenches of developer infrastructure, from sysadmin roots to running the platform that powers 5% of the internet.

The Incident You Never Had: Deterministic Simulations w/ Will Wilson (Antithesis CEO)

Most reliability engineering happens after something breaks. Will Wilson thinks that's the wrong place to be. As co-founder and CEO of Antithesis, the autonomous testing platform that just raised $105M in a Series A led by Jane Street, Will has spent years building the infrastructure to catch failure modes before they ever reach production. His starting point is uncomfortable: the testing practices most teams rely on are structurally incapable of finding the bugs that cause real incidents.

Burnout Doesn't Ask Permission: Recognizing, Recovering, and Rebuilding w/ Stephen Townsend

Burnout doesn't announce itself. For Stephen Townsend, SRE team lead and host of the Slight Reliability podcast, it crept in over months of mounting pressure on a massive transformation program, and announced itself overnight with an inability to sleep. In this episode, Stephen shares his personal burnout story with rare honesty: the physical symptoms he dismissed, the org structure that left him without autonomy, and the full year it took to recover.

Code Is Cheap, Reliability Isn't: Owning Production in the AI era w/ Swizec Teller

In this episode, Swizec Teller, author of the bestselling Scaling Fast, makes a bold claim: code is cheap, reliability is not. As AI coding tools accelerate feature development, the real competitive advantage shifts to operating systems reliably in production. We explore the hidden complexity of SRE work, the addictive nature of agentic coding, and why ownership — not automation — remains at the core of modern software engineering.

Democratizing Reliability: Giving Non-Engineers Real Operational Power with Dileshni Jayasinghe

Many companies don’t invest in incident management until something goes wrong. commonsku took a different path. In this episode of Humans of Reliability, Sylvain sits down with Dileshni Jayasingha, VP of Technology at commonsku, to talk about what it really takes to introduce incident management in a mature, profitable SaaS that had never formalized it. From rolling out observability and incident tooling to practicing internal status updates before going public, Dileshni shares how her team built the right muscles before they were forced to.

99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Not Diamond

Shipping systems powered by LLMs would be hard enough if the models stayed the same. But in reality, they don’t. Models get updated and deprecated at a pace traditional software wouldn’t. All while teams are still expected to hit reliability targets that look a lot like traditional SLAs.