How Nagarro used Gremlin to prevent a cascading failure outage

How Nagarro used Gremlin to prevent a cascading failure outage

May 28, 2024

Check out how Nagarro used Gremlin to help a client prevent a cascading failure before it caused an outage.

"Once we had tested a critical software that was doing millions of online transactions on a daily basis. The design was fail safe, providing redundancy on critical services by having multiple instances deployed on different VMs.

What we did was we ran a virtual machine terminate test to bring down an instance of that service with the hypothesis that it will recover automatically. Well, the service did recover automatically, but the system saw a cascading failure.

Now, what we learned is that the other services hosted on that same virtual machine didn't have enough redundancy to handle the transaction volume. We were proactively able to find a critical fault.

We avoided a production incident that could have potentially led to a bad experience and losses."
—Dushyant Sahni, Nagarro