How to make Netflix reliable: Address low-hanging fruit
Reliability doesn’t have to be fancy and dramatic. Kolton and his team dramatically improved Netflix reliability by focusing on low-hanging fruit. Find out how to be reliable at scale with Gremlin → https://www.gremlin.com/
FULL TRANSCRIPT:
My first holiday peak at Netflix, where my VP of engineering came to me and he said, "Kolton, what do you think the chance we make it through the holiday peak without an outage is?"
I thought about it for a minute and I said, "50/50."
Not horrible, but not great. Well, we did a lot of work over the next year. We did a lot of effort to go test our services and to really understand the failure modes, to really build that confidence that we knew how the system would behave when things went wrong.
And the next holiday peak when that same VP came to me and he asked me, "What's our confidence that our system is going to withstand an outage for this holiday peak?" My answer was, "90%, 95%." We had quite a deal of confidence. And that allowed us to sleep well at night to know that if something went wrong, we'd be paged and we would deal with it, but we weren't gonna be woken up for something silly because we'd addressed all the low-hanging fruit and we knew that things were, in general, running well.