Gremlin

San Jose, CA, USA
2016
  |  By Sam Rossoff
One of the real successes of the Agile Software development movement has been the push to have regular, frequent deployments. This has manifested as build and deployment automation and the general adoption of CI/CD. As engineers automate more processes of their software release lifecycle, an important question is how to automate Quality Assurance, which includes resilience testing and, more specifically, Fault Injection.
  |  By Andre Newman
CPU usage is one of the most common metrics used in observability and cloud computing. It’s for a good reason: CPU usage represents the amount of work a system is performing, and if it’s near 100% capacity, adding more work could make the system unstable. The solution is to scale - add more hosts with more CPU capacity, migrate some of your workloads to the new host, and split the traffic between them using a load balancer.
  |  By Andre Newman
2024 is off to a fast start here at Gremlin. Since our last release roundup, we’ve released new experiment types, new features to improve integration with cloud platforms, and improvements to our auto-detection processes. Now you can push processes to their limits, find dependencies even easier, limit when tests can be run, and much more. We also introduced a slew of platform improvements to improve efficiency, performance, and user experience in the Gremlin web application.
  |  By Andre Newman
We rarely think about how many processes are running on our systems. Modern CPUs are powerful enough to run thousands of processes concurrently, but at what point do our systems become oversaturated? When you’re running large-scale distributed applications, you might reach this limit sooner than you'd expect. How can you determine what that limit is, and how does that affect the number and complexity of the workloads you deploy?
  |  By Andre Newman
Memory is a surprisingly difficult thing to get right in cloud environments. The amount of memory (also called RAM, or random-access memory) in a system indirectly determines how many processes can run on a system, and how large those processes can get. You might be able to run a dozen database instances on a single host, but that same host may struggle to run a single large language model.
  |  By Andre Newman
If your organization asked you to report on the reliability improvements you’ve made over the past 90 days, would you be able to pull up a report? If you’re like many engineers, this question might make you anxious. Reliability is a difficult metric to quantify in a meaningful way, let alone measure.
  |  By Sam Rossoff
Many organizations who are looking to introduce Fault Injection as a testing technique start with non-production environments, but don't always go back and reconsider that choice as they mature beyond initial assessment. However, there's a strong case for running these tests in your live systems. It's important to consider the trade-offs when choosing to test in production or non-production environments, as it can have far-reaching impacts on the efficacy and cost of improving the resilience of software.
  |  By Andre Newman
Cloud computing has made provisioning new servers easy, fast, and relatively cheap. Almost anyone can log into a cloud console, spin up a new server, and deploy an application. And if they need greater uptime, major cloud providers include all kinds of settings, services, and configurations to add fault tolerance and failover. So why is it that many services fail when a single server instance fails?
  |  By Sam Rossoff
Fault injection is a tool, and like all tools, there are a variety of ways operators can employ it, but most of them tend to fall into one of two categories.
  |  By Gavin Cahill
Reliability risks are potential points of failure in your system where an outage could occur. If you can find and remediate reliability risks, then you can prevent incidents before they happen. In complex Kubernetes systems, these reliability risks can take a wide variety of forms, including node failures, pod or container crashes, missing autoscaling rules, misconfigured load balancing or application gateway rules, pod crash loops, and more. And they’re more prevalent than you might think.
  |  By Gremlin
Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Most Kubernetes clusters have reliability risks lurking just below the surface. You could spend hours or even days manually finding these risks, but what if someone could find them for you? With Detected Risks, Gremlin automates the work involved in finding and tracking reliability risks across your Kubernetes clusters. Surface failed Pods, mismatched image versions, missing resource definitions, and single points of failure, all without having to run a single test.
  |  By Gremlin
Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Pop quiz—what are all of the dependencies your services rely on? If you’re like most engineers, you probably struggled to come up with the answer. Modern applications are complex and rely on dozens (if not hundreds) of dependencies. Many teams rely on spreadsheets, but manual processes like these break down over time. What if you had a tool that found and tracked dependencies for you?
  |  By Gremlin
Learn how to automatically find and fix the most critical Kubernetes reliability risks in enterprise organizations. Recent research shows that nearly every organization has reliability risks in their Kubernetes clusters. Many of them are caused by simple misconfiguration, but they can have devastating consequences—including taking critical services offline. And while you could manually review every Kubernetes deployment, the speed and scale at which most organizations deploy to Kubernetes makes that impractical.
  |  By Gremlin
Join Gremlin CTO and Founder Kolton Andrus to hear practical strategies for building a collaborative culture of reliability. High-velocity DevOps orgs and complex cloud-native architectures have made reliability harder than ever. Organizations are turning to SREs to make sure systems are reliable, but with so many stakeholders and competing priorities, many companies are still struggling to get ahead of the outages and incidents—SREs simply can't do it all by themselves.
  |  By Gremlin
Today’s technology leaders are facing a reliability gap. Customers expect their apps to be fast and available. But with Devops and distributed systems driving more speed and complexity, it’s harder than ever to find and fix the reliability risks that can impact customer experience–before it’s too late. To close the Reliability gap, we need a reliability strategy. One that’s proactive, measurable, built-in and automated. We need a reliability platform.
  |  By Gremlin
Demonstrate your reliability expertise, increase your visibility, and advance your career with a Gremlin Enterprise Chaos Engineering certification. Chaos Engineering continues to grow in popularity and is rapidly becoming a job requirement for Engineering teams focused on reliability. In this webinar, Sr. Reliability Specialist Andre Newman goes over the mindset shifts, best practices, and key information you need to prep for your certification.
  |  By Gremlin
Does it feel like your team spends all its time putting out incident fires? Change the story with a proactive reliability program that actively improves reliability. Join reliability expert and engineering leader Jeff Nickoloff for a webinar that lays out the common traits for successful reliability programs so you can build more reliability and spend less time firefighting. You’ll also get a downloadable checklist worksheet to help you create and evaluate your reliability program.
  |  By Gremlin
This video showcases Gremlin's Detected Risks feature. Detected risks are high-priority reliability concerns that Gremlin automatically identifies in an environment. These include misconfigurations, bad default values, and reliability anti-patterns. Gremlin prioritizes these risks based on severity and impact, giving instantaneous feedback on risks and action items to improve the reliability and stability of each service.
  |  By Gremlin
Learn how to identify and track reliability risks, prioritize fixes, and prove results to your organization using Gremlin's free reliability tracker spreadsheet.
  |  By Gremlin
Today’s technology leaders are facing a reliability gap. Customers expect their apps to be fast and available. But with Devops and distributed systems driving more speed and complexity, it’s harder than ever to find and fix the reliability risks that can impact customer experience–before it’s too late. To close the Reliability gap, we need a reliability strategy. One that’s proactive, measurable, built-in and automated. We need a reliability management platform.
  |  By Gremlin
Systems fail, sometimes publicly and at great cost. Airlines have experienced system-wide ticketing outages, causing hundreds of flight cancellations and significant inconvenience to customers. Retailers have experienced website crashes on the busiest shopping days of the year, costing millions in lost revenue and customer goodwill. It is vital to understand both DevOps and SRE and the roles they play in preventing such outages.
  |  By Gremlin
Gremlin provides a variety of ways to test the resilience of your systems, which we call "attacks". Running different attacks lets you uncover unexpected behaviors, validate resilience mechanisms, and improve the overall reliability of your systems and services. This ebook explains each of Gremlin's attacks in complete detail, including what each attack does, how it impacts your systems, and the technical and business objectives the attack helps solve.
  |  By Gremlin
Learn the basics of Chaos Engineering: discover the tools, tests, and culture needed to create better software and prevent outages and downtime. This whitepaper provides a comprehensive introduction to the discipline of Chaos Engineering including why it is more needed than ever, how to get started, and best practices to maximize learnings and reduce risk.
  |  By Gremlin
By following this guide, you'll successfully increase your organization's reliability with minimal effort and risk. This document will serve as your guide to implementing Chaos Engineering and Gremlin within your organization. From educating your team on the principles of Chaos Engineering to running automated experiments, this guide will walk through each stage of the adoption process in order to ensure a smooth and successful rollout.
  |  By Gremlin
Amazon DynamoDB is fast, powerful, and intended for high availability. These are all valuable attributes in a data storage solution, but to be useful as advertised, it must be configured thoughtfully. Learn how to use Chaos Engineering to ensure DynamoDB performs the way you expect. In this guide, we cover: Amazon DynamoDB is one of the most popular NoSQL databases and is the data store of choice for many teams running production workloads in AWS.
  |  By Gremlin
Win over and convince your coworkers and management to explore and adopt Chaos Engineering and Site Reliability Engineering (SRE). The playbook provides ideas and techniques that can be used to articulate the need and benefits to internal stakeholders in your organization. It also guides the initial implementation in a way that will lead to success and growth across the organization. Implementing something new like Chaos Engineering successfully is a good way to get promoted and help the organization succeed, and this guide is here to help you.
  |  By Gremlin
MongoDB is designed for performance, scale, and high-availability. But, as with any software, you need to test your configuration to verify that it will work as advertised. Ensure that MongoDB performs the way you expect by using Chaos Engineering to test four key features. This guide includes four experiment tutorials to verify that MongoDB will perform reliably: In order to ensure you get the most out of MongoDB's rich features, including built-in data sharding and replication, it's crucial to test your configuration.

Gremlin aims to make the internet more reliable and prevent costly and reputation-damaging outages. Its failure-as-a-service platform empowers engineers to build more resilient systems through safe experimentation.

Downtime is expensive and can hurt your brand. Gremlin provides engineers with the framework to safely, securely, and easily simulate real outages with an ever-growing library of attacks. Turn failure into resilience with chaos engineering.

Build resilient infrastructure:

  • Resource Gremlins: Throttle CPU, Memory, I/O, and Disk.
  • State Gremlins: Reboot hosts, kill processes, travel in time.
  • Network Gremlins: Introduce latency, blackhole traffic, lose packets, fail DNS.

Test for application failure:

  • Test for failure in your code.
  • Fail or delay serverless functions.
  • Narrow the impact to a single user, device, or percentage of traffic.

Avoid downtime. Use Gremlin to turn failure into resilience.