Reliability is different for every business
Anish Behanan from @Capgemini talks about how reliability is different for every business, and you have to know what it means for your organization. Find out how to be more reliable with Gremlin → https://www.gremlin.com/
Full transcript:
You need to actually define what good looks like. It is just not about ensuring that there is 99% uptime.
What are the different factors that will lead to ensuring that 99% uptime and all of that contributes to actually being reliable. That is important for us to define, and then every business is different.
A reliability factor for a bank could be different than an insurance company. So I think it is important to define what reliability is for your business and then work towards achieving that reliability quotient.
If there is a data center, which has a thousand servers. Half of them actually gets affected by a power outage and half of them actually gets affected by, say, a cyber attack, right?
What is the time it takes for all of the servers to be back up and running as it was previous? We kind of emulate, or simulate, these various failure scenarios to ensure that the reliability, as in how long does a server take to come up? And how long does that observability platform pick it up and how long does that observability platform translate the whole process into auto recovery?
All of these are measures that we can actually define so that we actually achieve the right reliability quotient.