How to test your systems for scalability and redundancy with Fault Injection

Gremlin

Apr 11, 2024

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Join us for the next one→https://www.gremlin.com/officehours

Do you know if your services can tolerate losing a node? What about an entire availability zone? Or a region?‍

Large-scale outages aren’t unheard of. When you’re running critical services, it’s vital that those services can keep running even if an AZ or region fails. In addition to failing over, these services also need to scale quickly so traffic shifts don’t overwhelm your systems. How do you prove that a service is both scalable and redundant? The answer is with Fault Injection.

In this webinar, we’ll show you how to test the scalability and redundancy of your systems by testing them directly. We’ll use Fault Injection to simulate large-scale failures, use observability tools to monitor the state of our systems, and discuss ways of using our findings to make our systems more resilient.

You'll learn:

What is Fault Injection? Learn how simulating incidents is the first step towards resolving them.
How to run blackhole and shutdown experiments using Gremlin.
How to use observability to monitor your system's response, then use these insights to make reliability improvements.
‍
About Gremlin Office Hours:
See Gremlin in action as one of our experts guides you through the platform in our monthly interactive session. You’ll have an opportunity to have your questions answered during the interactive Q&A segment.

Check out previous Office Hours on-demand or sign up to join the next one at https://www.gremlin.com/officehours