Gremlin

gremlin

Failover Conf follow-up: Your team and culture questions answered!

Thank you all for joining us last week for Failover Conf 2! We had a great turnout this year, with over 1,800 participants, 20 sponsors, and 9 amazing sessions. After more than a year of virtual events and video calls, we know that Zoom fatigue is real. We tried to make this event different by finding new ways to bring the community together and thinking of fun new ways to shake up the conference formula.

Fireside Chat with Ines Sombra and Ana Medina Failover Conf 2021

Reliability is a requirement for the modern internet. Ana Medina joins Inés Sombra, Sr. Director of Engineering at Fastly, to discuss their approach to resilience, how the past year has influenced the way they work, and what practices your engineering organization can adopt to become more reliable.

Fireside Chat with Jesse Robbins and Kolton Andrus Failover Conf 2021

Long before Chaos Engineering was even a phrase, Jesse Robbins was Amazon.com's "Master of Disaster" using intentional failure to help the company become more reliable. Kolton Andrus (CEO at Gremlin), sits down with Jesse to learn more about his early work with GameDays, the evolution of reliability, and where the future of SRE lies.

Implementing DevSecOps in the DoD by Nicolas Chaillan Failover Conf 2021

Delivering software quickly and securely is important for every organization, but it's even more important at the US Department of Defence (DoD) where reliability directly impacts national security. Nicolas Chaillan (Chief Software Officer, US Air Force) will discuss the DoD Enterprise DevSecOps Initiative—an initiative he leads along with the DOD’s Chief Information Officer that brings automated software tools, services and standards to DoD programs. He'll also share about Platform One, the Air Force's DoD-wide DevSecOps Enterprise Level Service that provides managed IT services capabilities, on-boarding, support, and baked-in zero trust security. This insight from operating at the most rigorous level will help you level up your own organization.

Pragmatic Incident Response: Lessons learned from failures by Robert Ross Failover Conf 2021

Incident response is overwhelming. So where do you start? There's a lot of advice out there, but it's mostly theories that aren't taking reality into account. So how do you get a process in place that actually works and scales? In this session, FireHydrant CEO and Co-Founder, Robert Ross, will share quick stories from his experience as an SRE and what tips he’s learned along the way.

Whats Next for DevOps by Emily Freeman  Failover Conf 2021

For over a decade, the DevOps movement has been using cultural change to power technological transformation and help companies deliver better products faster and more reliably. While many organizations have embraced this change and reaped the benefits, it hasn't come without challenges and many more remain. In this session, Emily Freeman (author of DevOps for Dummies) shares what's next for DevOps and how it will impact your organization.

The Evolution of Observability and Monitoring panel discussion Failover Conf 2021

Observability and monitoring are critical to detecting and troubleshooting problems to build more reliable applications. As our systems become increasingly complex, our tools for getting this crucial visibility and the way we respond need to evolve too. We'll sit down with SRE leaders to discuss the processes they use to get the most insight into their applications, how they've increase the speed of detection and response, and what organizations need to do to stay on top of growing complexity.

Leaving the Nest: Guidelines, guardrails, and human error by Laura Santamaria Failover Conf 2021

When we talk about reliable systems, we talk a lot about human error. Human error in an incident or a bug report is often treated with a bit of a facepalm reaction. The term masks a lot of scenarios from accidents to exhaustion to everything in between. However, human error helps us understand where our processes failed and how we can prevent the same error from happening again. In short, we need to think in terms of a framework of guidelines and guardrails. In this short talk, let’s discuss how guidelines like runbooks and guardrails like automation can help us address the fact that everyone will, at some point, make mistakes.