Chaos Engineering: Be thoughtful about failure

Nov 18, 2025

Chaos Engineering isn’t about breaking things. It’s about being thoughtful about failure to learn something. Gremlin founder and CEO Kolton Andrus lays it out in this clip from when he sat down with Stephen Townshend on the Slight Reliability podcast!

FULL TRANSCRIPT:

The purpose is not just to see if we can break things, to break things, and that could probably be done fairly easily. The purpose is to be thoughtful about the failure and learn something, or to validate a resilience mechanism.

And so yeah, when you're thinking about this single host, what you should do is some failure mode analysis. Hey. What do I expect to have happen? And so this is another part of what I think is really the discipline that gets missed when we talk about just Chaos Engineering. And that's: Hey, we wanna sit down. We want to plan out these experiments. We want to have a hypothesis. Here's what we think will happen. We're gonna have ways to measure it. We might wanna have a fallback plan if things go wrong of how we're gonna clean it up and restore back to steady state. We probably want to notify some people if we're running in a shared environment so it's not a total surprise.

And then, yeah, and if we do the analysis and we say, "Hey, we're testing the redundancy of our server. And this application only runs on one server." Boom. You don't need to run the test. You know you're at risk. Go fix that. Go get it running on two or three hosts and then bring one of 'em down. Then you're validating this resilience mechanism you put in place.

But there's no need to put the system at risk if you already know the answer.