Latest Videos

How to fail with Serverless Jeremy Daly Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

Everything fails all the time. Knowing how to deal with these failures in serverless applications becomes essential to building resilient, highly-available systems. In traditional monolithic applications, catching errors and handling retries is relatively straightforward. But as our systems become more distributed, we now have multiple (often asynchronous) components processing events from several sources, all with vastly different retry behaviors and failure mechanisms. Utilizing old patterns can cause errors to get swallowed, creating brittle, unreliable systems that are difficult to debug and hard to maintain.

View Video

Gremlin

Read more about How to fail with Serverless Jeremy Daly Failover Conf 2020

Slowdown is the New Outage Marco Coulter Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

While outage-driven news headlines can cause stock prices to plummet short term, the performance-driven reputation loss is a slow burn for longer-term customer loss. This session compares slowdowns vs outages and the resulting need for insight more than observability. By understanding these difference, you'll be ready to drive agile applications, gain funding for lowering technical debt, and focus on customer retention.

View Video

Gremlin

Read more about Slowdown is the New Outage Marco Coulter Failover Conf 2020

The Halo of Resilience Engineering J. Paul Reed Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

Recent world-impacting events have caused us all to have to rethink the way we go about our daily work; in this talk, we'll look at how some of the pillars of Resilience Engineering might help you and your team deal with the changes we're all being forced to confront.

View Video

Gremlin

Read more about The Halo of Resilience Engineering J. Paul Reed Failover Conf 2020

Improving a Distributed System Post-Incident Julius Zerwick Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

In this session, we will dive into a case study of how a team can recover & improve a distributed system after a major incident. Distributed systems are more prone to failure than other systems due to their incredible complexity and scale, and incidents are a fact of life with these systems.

View Video