Gremlin

San Jose, CA, USA
2016
  |  By Andre Newman
Reliability testing and Chaos Engineering are essential for finding reliability risks and improving the resiliency of systems. Gremlin makes it easy to do so, but not every engineer needs access to the same experiments, systems, or services. That’s why we released customizable role-based access controls (RBAC), letting Gremlin customers control which actions your users can perform in Gremlin.
  |  By Andre Newman
Encryption is a fundamental part of nearly every modern application, whether you’re storing data, sending data to customers, or sharing data between backend services. Most organizations have a data encryption strategy, and nearly every web page is using HTTPS, thanks to initiatives like Let’s Encrypt. But setting up encryption isn’t a one-time initiative. Over time, the certificates backing modern encryption expire and need to be replaced.
  |  By Andre Newman
Load balancers are some of the most important load-bearing (pun intended) components in cloud environments. They perform multiple critical tasks: network switching, packet inspection, and of course, routing. Most cloud-based load balancers focus on load balancing within a single zone, but what if you have resources spread across multiple zones?
  |  By Andre Newman
Reliability testing and observability are similar in one important way: engineering teams know they should be doing it, but they’re not sure how to start, or they don’t have the right resources, or they need to focus on competing priorities like feature development and incident response. In an ideal world, reliability and observability would be automated processes that configure, monitor, and run themselves.
  |  By Gavin Cahill
When it comes to reliability, cloud providers use a Shared Responsibility Model. In essence, they’ll keep the infrastructure reliable, while you’re responsible for architecting reliability into your systems. To help make this easier, they’ve published a variety of best practice guides, such as the AWS Well-Architected Framework. These lengthy documents are filled with recommendations to help you architect a more secure, more reliable system.
  |  By Andre Newman
The worst thing you could do after successfully deploying to a new environment is to accidentally delete critical infrastructure. Unfortunately, that happened to one Google Cloud customer when their private cloud subscription was accidentally deleted, resulting in nearly two weeks of downtime. This isn’t an isolated problem either: Microsoft Azure had a similar problem when a typo inadvertently deleted an entire SQL Server instance rather than a specific database.
  |  By Gavin Cahill
There’s a reason why observability and incident response practices have become standard across modern software development. Anyone wanting to minimize downtime and deliver reliable, available applications needs to have fully instrumented systems and playbooks so they can respond quickly and effectively to outages or incidents. But there’s another piece to the reliability puzzle: resilience testing.
  |  By Ryan Detwiller
Today, Gremlin is introducing Gremlin for AWS, a suite of tools to more easily find and fix the reliability risks that cause downtime on AWS. The cloud opens up a range of reliability challenges that didn’t exist before, especially for customers running distributed, mission-critical workloads. Teams experience the pain of failed migrations, frequent incidents, and reliability toil, but often struggle to modernize their approach to reliability as they modernize their infrastructure.
  |  By Andre Newman
Migrating to a new platform can often feel like navigating a maze of technical challenges, especially when the platform is as complex as Kubernetes. Kubernetes has a vast number of features designed to help with deploying and managing large applications, but learning how to use it effectively can be just as challenging as‌ moving your workloads over. This doesn’t mean it’s impossible, of course, and there are several strategies for easing this process.
  |  By Andre Newman
Microservices have forever changed the way we build applications. Tools like Docker and Kubernetes made microservice-based architectures widely accessible to software developers, and cloud platforms like Amazon EKS made deploying containers fast and inexpensive. They've also enabled even small engineering teams to deploy code faster, leverage fault tolerance and redundancy, scale more efficiently, and take full ownership of their services from development all the way into production.
  |  By Gremlin
Improving reliability might seem daunting, but you'd be surprised how much impact you can have with a relatively light lift. "Reliability doesn't need to be everybody stopped the world for a month, kind of a tech debt thing. If we spent 20 minutes a week, we could actually save ourselves a ton of time over the course of the year. The business needs to be efficient and agile, but it's important that the reliability is there. And so we really need people to be able to react quickly, adapt, and do a little bit along the way.
  |  By Gremlin
Reliability efforts do take up some bandwidth, but in the end it's worth it—as our customers find out when their outage costs go down. "Everyone has their own priorities that they're dealing with. Given unlimited time and money, absolutely everyone would want to build the best possible system that is the most secure, performant, resilient, and everything.
  |  By Gremlin
Are your applications as reliable as you planned? How do you know? The only way to ensure systems are resilient to common failure conditions is to test them, yet many large enterprises struggle with the effort and expense to do so. In this webinar, Anantha Movva, a former head of SRE and Performance Engineering at one of the top 10 North American banks, will share how he drove Chaos Engineering and resilience testing adoption throughout his organization.
  |  By Gremlin
Reliability is everyone's problem—not just the SRE team's. "It's not just the SRE's problem. It's everybody's problem. So the SREs, they can run point and they can help report and help us understand, but we also have to hold the teams accountable. Are the teams investing time in reliability? Are they finding and fixing issues? Are we giving them space? And I think that comes back to, does the business see the benefit and do we have a good way of quantifying the benefit to the business?"—Kolton Andrus, Gremlin CTO.
  |  By Gremlin
You're going to spend time fixing reliability—but it's your choice whether it's during an outage or ahead of time on your schedule and for less costs. Which will you choose? "We all know when things go wrong, it cost us a million dollars and it was really bad. Let's have that never happen again. But when we say, I need every engineering team to spend one hour, one day a week on reliability, does everyone lose their mind, or is that a reasonable request? Can we amortize out the cost of that?
  |  By Gremlin
Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Fully-managed SaaS services offer incredible scalability and accessibility, but at a cost: they’re also single points of failure. If your application depends on a SaaS service and the service fails, guess who your customers will blame? We need to design applications to anticipate and work around managed service failures, but how do we do that without having to wait for the service to fail?
  |  By Gremlin
Thinking about integrating Gremlin into your existing pipeline? Look no farther than the Gremlin API. "The next step then was to build the right tooling such that the resiliency tests can be run from a pipeline. Gremlin's API first approach made it possible to do this in a very easy manner because everything that we could do from the UI and manually, we could replicate all of that through the API as well.
  |  By Gremlin
Are you recognizing the good work engineers do to prevent outages? "The people that are out there doing good work to prevent fires from ever occurring, we're not often recognizing them. We're not often rewarding them. And once things go wrong, someone comes in and fixes it. That's great. That's needed. But we're rewarding that behavior. And so it becomes a bit of people are motivated by what behavior you reward.
  |  By Gremlin
Did you know you can use the Gremlin API to integrate resiliency tests into your CI/CD pipeline? Our partner Nagarro has even made it part of their shift left package. "What we do is shift left and add a chaos stage to the pipeline. We have created the shift left accelerator package. It integrates with load tests and Gremlin APIs to set up the test scenario.
  |  By Gremlin
Gremlin for AWS is a suite of tools to more easily find and fix the reliability risks that cause downtime on AWS. The cloud opens up a range of reliability challenges that didn’t exist before, especially for customers running distributed, mission-critical workloads. Teams experience the pain of failed migrations, frequent incidents, and reliability toil, but often struggle to modernize their approach to reliability as they modernize their infrastructure. That’s where Gremlin for AWS can help.
  |  By Gremlin
Systems fail, sometimes publicly and at great cost. Airlines have experienced system-wide ticketing outages, causing hundreds of flight cancellations and significant inconvenience to customers. Retailers have experienced website crashes on the busiest shopping days of the year, costing millions in lost revenue and customer goodwill. It is vital to understand both DevOps and SRE and the roles they play in preventing such outages.
  |  By Gremlin
Gremlin provides a variety of ways to test the resilience of your systems, which we call "attacks". Running different attacks lets you uncover unexpected behaviors, validate resilience mechanisms, and improve the overall reliability of your systems and services. This ebook explains each of Gremlin's attacks in complete detail, including what each attack does, how it impacts your systems, and the technical and business objectives the attack helps solve.
  |  By Gremlin
Learn the basics of Chaos Engineering: discover the tools, tests, and culture needed to create better software and prevent outages and downtime. This whitepaper provides a comprehensive introduction to the discipline of Chaos Engineering including why it is more needed than ever, how to get started, and best practices to maximize learnings and reduce risk.
  |  By Gremlin
By following this guide, you'll successfully increase your organization's reliability with minimal effort and risk. This document will serve as your guide to implementing Chaos Engineering and Gremlin within your organization. From educating your team on the principles of Chaos Engineering to running automated experiments, this guide will walk through each stage of the adoption process in order to ensure a smooth and successful rollout.
  |  By Gremlin
Amazon DynamoDB is fast, powerful, and intended for high availability. These are all valuable attributes in a data storage solution, but to be useful as advertised, it must be configured thoughtfully. Learn how to use Chaos Engineering to ensure DynamoDB performs the way you expect. In this guide, we cover: Amazon DynamoDB is one of the most popular NoSQL databases and is the data store of choice for many teams running production workloads in AWS.
  |  By Gremlin
Win over and convince your coworkers and management to explore and adopt Chaos Engineering and Site Reliability Engineering (SRE). The playbook provides ideas and techniques that can be used to articulate the need and benefits to internal stakeholders in your organization. It also guides the initial implementation in a way that will lead to success and growth across the organization. Implementing something new like Chaos Engineering successfully is a good way to get promoted and help the organization succeed, and this guide is here to help you.
  |  By Gremlin
MongoDB is designed for performance, scale, and high-availability. But, as with any software, you need to test your configuration to verify that it will work as advertised. Ensure that MongoDB performs the way you expect by using Chaos Engineering to test four key features. This guide includes four experiment tutorials to verify that MongoDB will perform reliably: In order to ensure you get the most out of MongoDB's rich features, including built-in data sharding and replication, it's crucial to test your configuration.

Gremlin aims to make the internet more reliable and prevent costly and reputation-damaging outages. Its failure-as-a-service platform empowers engineers to build more resilient systems through safe experimentation.

Downtime is expensive and can hurt your brand. Gremlin provides engineers with the framework to safely, securely, and easily simulate real outages with an ever-growing library of attacks. Turn failure into resilience with chaos engineering.

Build resilient infrastructure:

  • Resource Gremlins: Throttle CPU, Memory, I/O, and Disk.
  • State Gremlins: Reboot hosts, kill processes, travel in time.
  • Network Gremlins: Introduce latency, blackhole traffic, lose packets, fail DNS.

Test for application failure:

  • Test for failure in your code.
  • Fail or delay serverless functions.
  • Narrow the impact to a single user, device, or percentage of traffic.

Avoid downtime. Use Gremlin to turn failure into resilience.