Software failure will occur, but we can be ready

Dec 12, 2025

Failures will occur, but reliability testing helps us understand them instead of being surprised. Gremlin founder and CEO Kolton Andrus sat down with Stephen Townshend on the Slight Reliability podcast to talk about how!

FULL TRANSCRIPT:

STEPHEN-
In my career, I've worked for a lot of organizations where the thought of experimenting, especially in production, is terrifying. And the mindset is, shouldn't we just stop all bad things from happening?

Why would we voluntarily let something bad happen? Now I understand why it makes so much sense to experiment with things going wrong because it's the only way we can practice and build this kinda muscle memory and to see the unpredictable ways in which things could go wrong. But have you got any advice on how you might change the mindset of someone in an organization like that or someone who's terrified, like says, "Oh no, why? Why would we do that? That's scary."

KOLTON-
It's a bit like saying, "Hey, we shouldn't crash test cars because we really don't want cars to crash." And the truth is cars are gonna crash whether we like it or not, so why don't we invest the effort to make it as safe as possible when it does occur?

I think that's one of the things I learned early on in my engineering career. The chance of an individual component failing is relatively low, but when you look at all the components of a system, the chance of something failing becomes quite likely. And so something that could only happen once in ten years happens every day in a large enough data set. And so failure's a constant, it's gonna occur.

The question is, do we want to understand it and mitigate it where possible, or do we want to be surprised by it?