Improving reliability starts with the 10 most common failures
Failures will occur, but reliability testing helps us understand them instead of being surprised. Gremlin founder and CEO Kolton Andrus sat down with Stephen Townshend on the Slight Reliability podcast to talk about how!
FULL TRANSCRIPT:
There's the same 10 things that go wrong in computers. Do I run outta CPU? Do I run outta memory? Do I run outta disc? How much IOPS capacity do I have? Time matters, certificates care about time, so what happens if time changes or we lose track of the NTP server? And the network, the classic adage, the network is not reliable. The truth is we've all built these distributed systems. Everything has to function over the network and everything relies on a dependency on another server. And so that's the next thing we test is not just how do I handle losing the network, but how do I handle losing things over the network? Okay. I want to test what happens when AWS S3 has an outage. Do I call up Amazon and ask them if they could take S3 down for an hour?
Nope, they're not gonna do it. That doesn't work. And that's not necessarily how the failure would look like.
So what they should focus on is what's in their control. So the way we would do that is we would drop all network traffic to S3 selectively. From our application's perspective, S3 just disappeared. That's what's gonna happen in the real world. What happens? How do we respond? How do we handle it?
And once we run that experiment, we can go learn, "Hey, are we comfortable with that? Did we find anything to fix?" And if things are going wrong, what we do is we just stop impacting the network and now we've got this great rollback switch instead of waiting for some other server to come up or come down, or some other change to make.
We just restore the network traffic. We're back up and running, and we've been able to mitigate the risk of the experiment quite quickly.