Operations | Monitoring | ITSM | DevOps | Cloud

Lead Times and Psychological Safety within the Five Ideals - Gene Kim

The biggest challenges engineering organizations face are not technical. They’re fundamental problems with how we think and go about doing work, and the environments that we work in. In this talk, Gene Kim will share the Five Ideals and how they relate to Chaos Engineering. He’ll also show how the Five Ideals help build stronger, better performing, and ultimately more reliable companies.

How many 9's are enough? Kolton Andrus  CTO Connection: Reducing engineering cycle time

How many nines of availability are enough? In this talk, Gremlin CEO Kolton Andrus shares insights from years at Amazon, Netflix, and now working with a wide array of customers across various disciplines and industries. He’ll describe what each level of availability looks like, the challenges faced at each stage, and the trade-offs required to achieve the next nine of uptime.

Improving a Distributed System Post-Incident Julius Zerwick Failover Conf 2020

In this session, we will dive into a case study of how a team can recover & improve a distributed system after a major incident. Distributed systems are more prone to failure than other systems due to their incredible complexity and scale, and incidents are a fact of life with these systems.

Built-in Application Resiliency Allan Shone  Failover Conf 2020

When starting a new application build, starting with an eye on resiliency prevents headaches down the line. There are many ways to tackle this, especially within different language environments and system eco-systems, but there are many shared across them all. Getting a high-level take-away list to use as a reference later, from a dive into them during this talk, viewers will learn how to develop software that is more fault-tolerant and able to with-stand impact of failures.

Pitfalls in Measuring SLOs  Danyel Fisher & Liz Fong-Jones  Failover Conf 2020

We built support for SLOs (Service Level Objectives) against our event store so we could monitor our own complex distributed system. In the process of doing so, we learned that there were a number of important aspects that we didn’t expect from carefully reading the SRE workbook. This talk is the story of the missing pieces, unexpected pitfalls, and how we solved those problems. We’d like to share what we learned and how we iterated on our SLO adventure.

Human-in-the-Loop DevOps  Taylor Barnett  Failover Conf 2020

Within DevOps, automation has become a North Star. We want to automate the toil away, but the goal of "no toil" is unattainable. Many runbooks can only be partially automated because they still require human intervention and insights. Human-in-the-Loop DevOps is the idea that we can benefit from automating toil while still embracing the human interaction in specific tasks.