Reliability is trust in your systems and your team
Reliable systems mean you can trust your systems, but also your team. Find out how to be reliable with Gremlin → https://www.gremlin.com/
Full transcript:
I think a big part of reliability is trust. Trusting in your systems to do what's expected, even when conditions are unexpected. Things like backing off of a dependency when it's overloaded or scaling up to meet high demand and scaling back down again to save money when demand goes back down. These are all things we trust in our systems to do without us having to constantly watch them 24/7.
There's a quote I really like that goes something like, "We don't rise to the level of our aspirations. We fall to the level of our training."
This is why reliability testing is so important, right? It's beyond just sharpening your systems. It's also about making sure your team knows what to do in the face of failure. Simple things like where to look for metrics and how to understand behavior patterns, when to expect monitors to go off, how to correlate cause and effect. These are all things that you can train and learn to jump to more quickly.