How we keep Gremlin at five 9s

Sep 17, 2025

Kolton Andrus, Founder and CEO of Gremlin, walks you through how we keep Gremlin at five 9s availability. Find out how to be reliable with Gremlin → https://www.gremlin.com/

FULL TRANSCRIPT:

 At Gremlin, we kind of sucked at this five years ago. We had the tooling, we had the expertise, we had the intent, and we struggled to get people to just run the test on a regular basis. It was a bit spotty. It was a bit one-off. Over the last couple of years, I've focused a lot of my time and effort on product and engineering, and one of the things I did when we built this new Reliability Management, we built these reliability scores, is I did a few things:

First of all, we have an on-call rotation at Gremlin. Every engineer takes a turn. Every engineer runs all the reliability tests the week they're on call. So we just made it clear this is every person on the team's responsibility. We're not pushing this off to one person. The next thing I did is in our product operations meetings and our engineering staff meetings, I pulled up the dashboard and I pulled up the score and I looked at it.

And I asked questions about it and I said, this is important to me. And the team actually started in the on-call handoff comparing the diffs of the scores so they could explain why it had changed. Well, fast forward a couple of years, every person on my team has run all the tests.

Almost every test passes, our scores are in the upper nineties. I'll brag for a minute. We've been at four to five nines for the lifetime of the company. We know what we're doing here and we've practiced what we've preached, but the last couple of years have been exceptionally solid because of all the work we've done.

So it's hard to do, but if you bring in that accountability and those incentives you show your team that it's important. You can drive the right behavior.