San Jose, CA, USA
May 4, 2021   |  By James Thigpen
Thank you all for joining us last week for Failover Conf 2! We had a great turnout this year, with over 1,800 participants, 20 sponsors, and 9 amazing sessions. After more than a year of virtual events and video calls, we know that Zoom fatigue is real. We tried to make this event different by finding new ways to bring the community together and thinking of fun new ways to shake up the conference formula.
Apr 27, 2021   |  By Matt Schillerstrom
Gremlin helps teams proactively improve the reliability of their systems by running chaos experiments on infrastructure including hosts, containers, and Kubernetes clusters. But as microservice-based architectures and automated cloud platforms become the norm, engineers are shifting their focus from managing infrastructure to managing services. In order to keep these services as resilient as possible, they need tools that can help them find failure modes, reduce incidents, and improve availability.
Apr 22, 2021   |  By Matt Schillerstrom
Today, Gremlin is excited to announce the ability to create an API key that can perform actions with the same set of permissions as your user account. This allows you to automate Gremlin tasks safely and securely.
Apr 1, 2021   |  By Gremlin
Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin. With everyone working remotely, video conference tools like Zoom have been a critical part of maintaining business continuity. It’s truly amazing that we can continue to work and connect with one another, even during a time where getting together in an office hasn’t been possible…
Mar 4, 2021   |  By Andre Newman
Get started with Gremlin's Chaos Engineering tools to safely, securely, and simply inject failure into your systems to find weaknesses before they cause customer-facing issues. API gateways are a critical component of distributed systems and cloud-native deployments. They perform many important functions including request routing, caching, user authentication, rate limiting, and metrics collection. However, this means that any failures in your API gateway can put your entire deployment at risk.
Feb 16, 2021   |  By Andre Newman
When reading about Chaos Engineering, you’ll likely hear the terms “fault injection” or “failure injection.” As the name suggests, fault injection is a technique for deliberately introducing stress or failure into a system in order to see how the system responds. But what exactly does this mean, and how does this relate to Chaos Engineering?
Feb 9, 2021   |  By Andre Newman
Transport Layer Security (TLS), and its preceding protocol, Secure Sockets Layer (SSL), are essential components of the modern Internet. By encrypting network communications, TLS protects both users and organizations from publicly exposing their in-transit data to third parties. This is especially true for the web, where TLS is used to secure HTTP traffic (HTTPS) between backend servers and customers’ browsers.
Jan 28, 2021   |  By Taylor Smith
Modern applications are changing, and traditional testing practices are no longer up to the task. Learn more about the changing landscape of QA and how Chaos Engineering provides the necessary framework for testing modern applications.
Jan 26, 2021   |  By Aileen Horgan
Five years ago today, our co-founders launched Gremlin with a simple but bold mission: Build a more reliable internet. Over the past five years, the practice of Chaos Engineering is increasingly employed as a means for proactively testing systems to make them more resilient and reliable.
Jan 25, 2021   |  By Andre Newman
What does reliability look like at a company that has thousands of employees and provides critical communication services to over 150,000 customers? We talked with Tyler Wells, Senior Director of Engineering at Twilio, to learn how he and his team created a culture of reliability at Twilio. He talked in depth about his experiences developing reliability goals, building reliability practices, and aligning engineering teams on these objectives.
May 7, 2021   |  By Gremlin
In this presentation, Tammy shares important failure modes to consider when responsible for the reliability of Kubernetes in your organization.
Apr 29, 2021   |  By Gremlin
Matt Stratton, host of the Arrested DevOps podcast, will host Jeff Smith, Director of Production Operations at Centro and author of the book "Operations Anti-patterns, DevOps Solutions" for an engaging conversation about building reliable teams using DevOps principles.
Apr 29, 2021   |  By Gremlin
Long before Chaos Engineering was even a phrase, Jesse Robbins was's "Master of Disaster" using intentional failure to help the company become more reliable. Kolton Andrus (CEO at Gremlin), sits down with Jesse to learn more about his early work with GameDays, the evolution of reliability, and where the future of SRE lies.
Apr 29, 2021   |  By Gremlin
Reliability is a requirement for the modern internet. Ana Medina joins Inés Sombra, Sr. Director of Engineering at Fastly, to discuss their approach to resilience, how the past year has influenced the way they work, and what practices your engineering organization can adopt to become more reliable.
Apr 28, 2021   |  By Gremlin
When we talk about reliable systems, we talk a lot about human error. Human error in an incident or a bug report is often treated with a bit of a facepalm reaction. The term masks a lot of scenarios from accidents to exhaustion to everything in between. However, human error helps us understand where our processes failed and how we can prevent the same error from happening again. In short, we need to think in terms of a framework of guidelines and guardrails. In this short talk, let’s discuss how guidelines like runbooks and guardrails like automation can help us address the fact that everyone will, at some point, make mistakes.
Apr 28, 2021   |  By Gremlin
Delivering software quickly and securely is important for every organization, but it's even more important at the US Department of Defence (DoD) where reliability directly impacts national security. Nicolas Chaillan (Chief Software Officer, US Air Force) will discuss the DoD Enterprise DevSecOps Initiative—an initiative he leads along with the DOD’s Chief Information Officer that brings automated software tools, services and standards to DoD programs. He'll also share about Platform One, the Air Force's DoD-wide DevSecOps Enterprise Level Service that provides managed IT services capabilities, on-boarding, support, and baked-in zero trust security. This insight from operating at the most rigorous level will help you level up your own organization.
Apr 28, 2021   |  By Gremlin
Incident response is overwhelming. So where do you start? There's a lot of advice out there, but it's mostly theories that aren't taking reality into account. So how do you get a process in place that actually works and scales? In this session, FireHydrant CEO and Co-Founder, Robert Ross, will share quick stories from his experience as an SRE and what tips he’s learned along the way.
Apr 28, 2021   |  By Gremlin
For over a decade, the DevOps movement has been using cultural change to power technological transformation and help companies deliver better products faster and more reliably. While many organizations have embraced this change and reaped the benefits, it hasn't come without challenges and many more remain. In this session, Emily Freeman (author of DevOps for Dummies) shares what's next for DevOps and how it will impact your organization.
Apr 28, 2021   |  By Gremlin
Observability and monitoring are critical to detecting and troubleshooting problems to build more reliable applications. As our systems become increasingly complex, our tools for getting this crucial visibility and the way we respond need to evolve too. We'll sit down with SRE leaders to discuss the processes they use to get the most insight into their applications, how they've increase the speed of detection and response, and what organizations need to do to stay on top of growing complexity.
Apr 28, 2021   |  By Gremlin
The most successful organizations are the ones that embrace change and use it to become stronger and more resilient. In this panel discussion, we'll talk with engineering leaders about how they adapted to the challenges of 2020, what successes (and failures) they've seen, and where the future of reliable engineering teams is headed.
Jul 25, 2020   |  By Gremlin
Learn the basics of Chaos Engineering: discover the tools, tests, and culture needed to create better software and prevent outages and downtime. This whitepaper provides a comprehensive introduction to the discipline of Chaos Engineering including why it is more needed than ever, how to get started, and best practices to maximize learnings and reduce risk.
Jul 25, 2020   |  By Gremlin
By following this guide, you'll successfully increase your organization's reliability with minimal effort and risk. This document will serve as your guide to implementing Chaos Engineering and Gremlin within your organization. From educating your team on the principles of Chaos Engineering to running automated experiments, this guide will walk through each stage of the adoption process in order to ensure a smooth and successful rollout.
Jul 25, 2020   |  By Gremlin
Amazon DynamoDB is fast, powerful, and intended for high availability. These are all valuable attributes in a data storage solution, but to be useful as advertised, it must be configured thoughtfully. Learn how to use Chaos Engineering to ensure DynamoDB performs the way you expect. In this guide, we cover: Amazon DynamoDB is one of the most popular NoSQL databases and is the data store of choice for many teams running production workloads in AWS.
Jul 1, 2020   |  By Gremlin
Win over and convince your coworkers and management to explore and adopt Chaos Engineering and Site Reliability Engineering (SRE). The playbook provides ideas and techniques that can be used to articulate the need and benefits to internal stakeholders in your organization. It also guides the initial implementation in a way that will lead to success and growth across the organization. Implementing something new like Chaos Engineering successfully is a good way to get promoted and help the organization succeed, and this guide is here to help you.
Jul 1, 2020   |  By Gremlin
MongoDB is designed for performance, scale, and high-availability. But, as with any software, you need to test your configuration to verify that it will work as advertised. Ensure that MongoDB performs the way you expect by using Chaos Engineering to test four key features. This guide includes four experiment tutorials to verify that MongoDB will perform reliably: In order to ensure you get the most out of MongoDB's rich features, including built-in data sharding and replication, it's crucial to test your configuration.

Gremlin aims to make the internet more reliable and prevent costly and reputation-damaging outages. Its failure-as-a-service platform empowers engineers to build more resilient systems through safe experimentation.

Downtime is expensive and can hurt your brand. Gremlin provides engineers with the framework to safely, securely, and easily simulate real outages with an ever-growing library of attacks. Turn failure into resilience with chaos engineering.

Build resilient infrastructure:

  • Resource Gremlins: Throttle CPU, Memory, I/O, and Disk.
  • State Gremlins: Reboot hosts, kill processes, travel in time.
  • Network Gremlins: Introduce latency, blackhole traffic, lose packets, fail DNS.

Test for application failure:

  • Test for failure in your code.
  • Fail or delay serverless functions.
  • Narrow the impact to a single user, device, or percentage of traffic.

Avoid downtime. Use Gremlin to turn failure into resilience.