Change engineering culture with Chaos Engineering
How do you spur an engineering cultural shift with Chaos Engineering? Gremlin founder and CEO Kolton Andrus explains how—and how it changed the Gremlin platform. Find out more → https://www.gremlin.com/
FULL TRANSCRIPT:
Chaos Engineering's a great technique. Just like crash testing cars or vaccinating patients. It's good for us to go inject failure in order to build immunity, understand how a system behaves, and improve it in order to make it more reliable.
But what Chaos Engineering doesn't help us accomplish is the social side of the equation. How do we go get the people to do the right things? How do we know that the engineers have the right tools? How do we give them the feedback to let them know that they're making progress and improving? How do we help leadership have visibility into the great work that's being done?
Well, this is one of the lessons we've learned at Gremlin. In the beginning, we wanted to build the best Chaos Engineering tool and platform. You know, something that any expert level user would be able to do anything they could dream of. But what we learned is we really needed to build a tool for your average engineer.
We needed to go build in a lot of defaults, a lot of good recommendations, a lot of things that help you analyze and understand the results because people are busy. They don't necessarily have the time to become an expert in every place, and leadership often cares, and they know it's important, but they're also tracking many things.
And so if we can't simplify the data they're collecting and the way in which they're able to process it, then a lot of the good work gets missed. A lot of people don't get promoted because leadership isn't able to recognize that ultimately we've made the system much more reliable because they're not hearing about outages on a daily basis.
And so we wanna replace that reactive negative component of outages with a positive measured approach of how we baseline and improve the system over time. And that's really reliability management in a nutshell.