May 2020

Human-in-the-Loop DevOps Taylor Barnett Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

Within DevOps, automation has become a North Star. We want to automate the toil away, but the goal of "no toil" is unattainable. Many runbooks can only be partially automated because they still require human intervention and insights. Human-in-the-Loop DevOps is the idea that we can benefit from automating toil while still embracing the human interaction in specific tasks.

View Video

Gremlin

Read more about Human-in-the-Loop DevOps Taylor Barnett Failover Conf 2020

The Future of DevOps is Resilience Engineering Amy Tobey Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

For more than a decade, many of us have been working to bring Devops to organizations around the world. We’ve made amazing progress, but there’s so much more to do. Now that we have continuous integration & deployment widespread and developers are taking more ownership of production, what’s next? Amy will talk about what Resilience Engineering is, how it relates to devops, and how she thinks it gives us the science and research we need to take our organizations to the next level of robustness while remaining agile and growing our ability to care for the people around us.

View Video

Gremlin

Read more about The Future of DevOps is Resilience Engineering Amy Tobey Failover Conf 2020

Performing chaos in a serverless world Gunnar Grosch Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

Chaos engineering is the practice of hypothesis testing through planned experiments to gain a better understanding of a system’s behavior. The principles of chaos engineering have been around for years, and we have now reached the point where chaos engineering has gone from just being a buzzword and practice used by a few large organizations in very specific fields, to it being put in to use by companies of all sizes and industries.

View Video

Gremlin

Read more about Performing chaos in a serverless world Gunnar Grosch Failover Conf 2020

Swim Don't Sink: Why Training Matters to a Site Reliability Engineering Practice Jennifer Petoff

May 5, 2020 By Gremlin In Gremlin

Do you offer training to the engineers in your organization or do you throw them off the deep end to “sink or swim”? Providing training and education is universally important to set team members up for success in your organization and is critical for establishing a thriving Site Reliability Engineering (SRE) or DevOps practice and culture in the first place.

View Video

Gremlin

Read more about Swim Don't Sink: Why Training Matters to a Site Reliability Engineering Practice Jennifer Petoff

Fight, Flight, or Freeze - Releasing Organizational Trauma Matt Stratton Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

When humans are faced with a traumatic experience, our brains kick in with survival mechanisms. These mechanisms are the familiar fight or flight response, but can also include the freeze response - which occurs when we are terrified or feel that there is no chance of escape.

View Video

Gremlin

Read more about Fight, Flight, or Freeze - Releasing Organizational Trauma Matt Stratton Failover Conf 2020

Y2K and Other Disappointing Disasters: Risk Reduction and Harm Mitigation Heidi Waterhouse

May 5, 2020 By Gremlin In Gremlin

Every disaster is a concatenation of smaller failures. How can we design software and processes to accept that we live in an imperfect world? Explore the concepts of resiliency, harm reduction, over-engineering, and planning for failure with real examples.

View Video

Gremlin

Read more about Y2K and Other Disappointing Disasters: Risk Reduction and Harm Mitigation Heidi Waterhouse

How to fail with Serverless Jeremy Daly Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

Everything fails all the time. Knowing how to deal with these failures in serverless applications becomes essential to building resilient, highly-available systems. In traditional monolithic applications, catching errors and handling retries is relatively straightforward. But as our systems become more distributed, we now have multiple (often asynchronous) components processing events from several sources, all with vastly different retry behaviors and failure mechanisms. Utilizing old patterns can cause errors to get swallowed, creating brittle, unreliable systems that are difficult to debug and hard to maintain.

View Video

Gremlin

Read more about How to fail with Serverless Jeremy Daly Failover Conf 2020

Slowdown is the New Outage Marco Coulter Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

While outage-driven news headlines can cause stock prices to plummet short term, the performance-driven reputation loss is a slow burn for longer-term customer loss. This session compares slowdowns vs outages and the resulting need for insight more than observability. By understanding these difference, you'll be ready to drive agile applications, gain funding for lowering technical debt, and focus on customer retention.

View Video

Gremlin

Read more about Slowdown is the New Outage Marco Coulter Failover Conf 2020

The Halo of Resilience Engineering J. Paul Reed Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

Recent world-impacting events have caused us all to have to rethink the way we go about our daily work; in this talk, we'll look at how some of the pillars of Resilience Engineering might help you and your team deal with the changes we're all being forced to confront.

View Video

Gremlin

Read more about The Halo of Resilience Engineering J. Paul Reed Failover Conf 2020

Improving a Distributed System Post-Incident Julius Zerwick Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

In this session, we will dive into a case study of how a team can recover & improve a distributed system after a major incident. Distributed systems are more prone to failure than other systems due to their incredible complexity and scale, and incidents are a fact of life with these systems.

View Video

Gremlin

Read more about Improving a Distributed System Post-Incident Julius Zerwick Failover Conf 2020

Reliability Matters More Than Ever Tammy Butow Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

Chaos and uncertainty are all around us. Tammy Butow kicks off Failover Conf by sharing why reliability and resilience matter now more than ever — and how you can achieve it.

View Video

Gremlin

Read more about Reliability Matters More Than Ever Tammy Butow Failover Conf 2020

Built-in Application Resiliency Allan Shone Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

When starting a new application build, starting with an eye on resiliency prevents headaches down the line. There are many ways to tackle this, especially within different language environments and system eco-systems, but there are many shared across them all. Getting a high-level take-away list to use as a reference later, from a dive into them during this talk, viewers will learn how to develop software that is more fault-tolerant and able to with-stand impact of failures.

View Video

Gremlin

Read more about Built-in Application Resiliency Allan Shone Failover Conf 2020

Pitfalls in Measuring SLOs Danyel Fisher & Liz Fong-Jones Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

We built support for SLOs (Service Level Objectives) against our event store so we could monitor our own complex distributed system. In the process of doing so, we learned that there were a number of important aspects that we didn’t expect from carefully reading the SRE workbook. This talk is the story of the missing pieces, unexpected pitfalls, and how we solved those problems. We’d like to share what we learned and how we iterated on our SLO adventure.

View Video

Gremlin

Read more about Pitfalls in Measuring SLOs Danyel Fisher & Liz Fong-Jones Failover Conf 2020

Operations | Monitoring | ITSM | DevOps | Cloud

May 2020

Human-in-the-Loop DevOps Taylor Barnett Failover Conf 2020

The Future of DevOps is Resilience Engineering Amy Tobey Failover Conf 2020

Performing chaos in a serverless world Gunnar Grosch Failover Conf 2020

Swim Don't Sink: Why Training Matters to a Site Reliability Engineering Practice Jennifer Petoff

Fight, Flight, or Freeze - Releasing Organizational Trauma Matt Stratton Failover Conf 2020

Y2K and Other Disappointing Disasters: Risk Reduction and Harm Mitigation Heidi Waterhouse

How to fail with Serverless Jeremy Daly Failover Conf 2020

Slowdown is the New Outage Marco Coulter Failover Conf 2020

The Halo of Resilience Engineering J. Paul Reed Failover Conf 2020

Improving a Distributed System Post-Incident Julius Zerwick Failover Conf 2020

Reliability Matters More Than Ever Tammy Butow Failover Conf 2020

Built-in Application Resiliency Allan Shone Failover Conf 2020

Pitfalls in Measuring SLOs Danyel Fisher & Liz Fong-Jones Failover Conf 2020

Monthly Archive

Follow Us