Operations | Monitoring | ITSM | DevOps | Cloud

When Disaster Strikes: Ensuring Your DRP Actually Works

Black swan events are inherently unpredictable—you can’t prepare for every possible threat. Instead, you must identify the ways systems can fail and develop strategies to restore them to full service when these failures happen. But a disaster recovery plan (DRP) can’t be relied on until it’s been proven to work. The use of Chaos Engineering allows you to test your DRP much more safely and predictably than you could otherwise.

SRE's Guide to Chaos & Observability

Today’s distributed, cloud-based environments are incredibly complex. Not only does each component depend on many others, but modern systems are also highly dynamic—changing frequently as teams push new code or make updates to infrastructure. Taming this complexity to ensure reliability requires end-to-end observability to understand how components depend on each other. Additionally, proactive Chaos Engineering combined with AI-driven observability lets you uncover “unknown unknowns” that impact how your system will respond to different failure scenarios.

Building Reliable Applications Webinar 6 17 21

Test-driven development (TDD) is a process that ensures quality in the applications we develop while guarding against feature creep/skew. But as our applications have become increasingly complex, traditional testing methods are not enough. Traditional testing only evaluates what we know, but complex systems often fail due to unknowns—the things that are almost impossible to test because we are unaware of them. Chaos Engineering is the exception that allows us to test for what we don’t know.

Gremlin ALFI Demo - AWS RDS Unavailable - Chaos Engineering

In this demo, we'll share how you can use ALFI (Application Level Failure Injection) to make AWS RDS unavailable. This enables you to learn how your application handles different failure modes. We'll be using the ALFI Latency attack to perform this Chaos Engineering experiment.

Fireside Chat with Jesse Robbins and Kolton Andrus Failover Conf 2021

Long before Chaos Engineering was even a phrase, Jesse Robbins was Amazon.com's "Master of Disaster" using intentional failure to help the company become more reliable. Kolton Andrus (CEO at Gremlin), sits down with Jesse to learn more about his early work with GameDays, the evolution of reliability, and where the future of SRE lies.