Operations | Monitoring | ITSM | DevOps | Cloud

Gremlin ALFI Demo - AWS RDS Unavailable - Chaos Engineering

In this demo, we'll share how you can use ALFI (Application Level Failure Injection) to make AWS RDS unavailable. This enables you to learn how your application handles different failure modes. We'll be using the ALFI Latency attack to perform this Chaos Engineering experiment.

Announcing the Gremlin Chaos Engineering Practitioner Certificate Program

Get started with Gremlin's Chaos Engineering tools to safely, securely, and simply inject failure into your systems to find weaknesses before they cause customer-facing issues. Chaos Engineering continues to grow in popularity and is rapidly becoming a job requirement. To help Engineering and Testing teams meet the need, we’re launching our first ever Gremlin Chaos Engineering Practitioner Certificate Program!

Podcast: Break Things on Purpose | Jose Nino, Staff Software Engineer at Lyft

Get started with Gremlin's Chaos Engineering tools to safely, securely, and simply inject failure into your systems to find weaknesses before they cause customer-facing issues. Break Things on Purpose is a podcast for all-things Chaos Engineering. Check out our latest episode below. You can subscribe to Break Things on Purpose wherever you get your podcasts. If you have feedback about the show, find us on Twitter at @BTOPpod or shoot us a note at podcast@gremlin.com!

Failover Conf follow-up: Your team and culture questions answered!

Thank you all for joining us last week for Failover Conf 2! We had a great turnout this year, with over 1,800 participants, 20 sponsors, and 9 amazing sessions. After more than a year of virtual events and video calls, we know that Zoom fatigue is real. We tried to make this event different by finding new ways to bring the community together and thinking of fun new ways to shake up the conference formula.

Fireside Chat with Jesse Robbins and Kolton Andrus Failover Conf 2021

Long before Chaos Engineering was even a phrase, Jesse Robbins was Amazon.com's "Master of Disaster" using intentional failure to help the company become more reliable. Kolton Andrus (CEO at Gremlin), sits down with Jesse to learn more about his early work with GameDays, the evolution of reliability, and where the future of SRE lies.

Fireside Chat with Ines Sombra and Ana Medina Failover Conf 2021

Reliability is a requirement for the modern internet. Ana Medina joins Inés Sombra, Sr. Director of Engineering at Fastly, to discuss their approach to resilience, how the past year has influenced the way they work, and what practices your engineering organization can adopt to become more reliable.

Leaving the Nest: Guidelines, guardrails, and human error by Laura Santamaria Failover Conf 2021

When we talk about reliable systems, we talk a lot about human error. Human error in an incident or a bug report is often treated with a bit of a facepalm reaction. The term masks a lot of scenarios from accidents to exhaustion to everything in between. However, human error helps us understand where our processes failed and how we can prevent the same error from happening again. In short, we need to think in terms of a framework of guidelines and guardrails. In this short talk, let’s discuss how guidelines like runbooks and guardrails like automation can help us address the fact that everyone will, at some point, make mistakes.