San Jose, CA, USA
Feb 8, 2023   |  By Andre Newman
In January of 2023, Google released its infrastructure reliability guide, which provides guidelines on how to build high-availability applications in Google Cloud. While it's written for Google Cloud, it provides some excellent general-purpose information on how to architect reliable applications on any cloud provider, including: In this blog, we'll explain each of these factors and how you can use Gremlin to ensure you're meeting your reliability requirements.
Feb 6, 2023   |  By Andre Newman
Imagine a perfect world where software releases ship bug-free. Developers write perfect code the first time, all tests pass without issues, operations teams effortlessly deploy builds to production, and customers never experience defects. Everyone's happy, and the Engineering team can focus exclusively on building and delivering features. Of course, we don't live in a perfect world.
Jan 31, 2023   |  By Andre Newman
For many businesses, prioritizing reliability is an ongoing challenge. Building reliable systems and services is critical for growing revenue and customer trust, but other initiatives—like building new products and features—often take precedence since they provide a clearer and more immediate return. That's not to say reliability doesn't have clear value, but proving this value to business leaders can be tricky.
Jan 19, 2023   |  By Andre Newman
Transport Layer Security (TLS), and its preceding protocol, Secure Sockets Layer (SSL), are essential to the modern Internet. Encrypting network communications using TLS protects users and organizations from publicly exposing in-transit data to third parties. This is especially important for the web, where TLS secures HTTP traffic (HTTPS) between backend servers and customers’ browsers.
Nov 10, 2022   |  By Andre Newman
How can you tell if your systems are reliable when under load? A common answer is to open your observability dashboards, wait for a high-traffic event (like Black Friday), and cross your fingers. While this approach is certainly effective, it's far from ideal. Without proactive reliability and load testing, we have no idea if a system will hold up to real-world usage patterns, which could mean a production outage at the worst possible time.
Nov 7, 2022   |  By Andre Newman
Modern applications are a web of interdependent services. As applications grow in size and complexity, and as more engineering teams adopt service-based architectures like microservices, this web becomes deeper and denser. Eventually, keeping track of the interdependencies between services becomes a complex and time-consuming task in and of itself. In addition, if any of these dependencies fails, it can have cascading impacts on the rest of your services and on the application as a whole.
Oct 25, 2022   |  By Andre Newman
Part of a successful reliability program is being able to monitor and review your progress toward improving reliability. Being able to run tests on services is a big part of it, but how can you tell you're making progress if you can only see your latest test results? There should be a way to track improvements or regressions in your reliability testing practice across your organization in a way that's easy to digest. That's where the Reliability Dashboard comes in.
Oct 20, 2022   |  By Andre Newman
Measuring and improving the reliability of technical systems has always been challenging. As an industry, we've developed several practices to try and address reliability concerns, such as incident response, observability, and Chaos Engineering. This led SREs and service owners to measure reliability in a handful of ways.
Oct 11, 2022   |  By Andre Newman
To many engineers, the idea that you can accurately and comprehensively track your application's user experience using just a few simple metrics might sound far-fetched. Believe it or not, there are four metrics that aim to do just that. They're called the four Golden Signals and should be a core part of your observability and reliability practices.
Sep 2, 2022   |  By Andre Newman
The past ten years marked a significant change in how software teams build and deploy applications. We moved away from bulky, slow, monolithic applications toward lightweight, scalable, distributed service-based applications. Meanwhile, tools like Docker, Kubernetes, and other container platforms helped accelerate this process. Despite this sudden growth, a fundamental question remains: what exactly is a service, and how does it fit into a microservice architecture?
Nov 16, 2022   |  By Gremlin
Learn how to integrate your load testing tool with Gremlin Reliability Management so you can automatically run load tests alongside your reliability tests.
Oct 24, 2022   |  By Gremlin
Gremlin Reliability Management helps teams standardize and automate reliability, one service at a time. In this video, we walk through the platform by showing you how to add your services to Gremlin, integrate your Golden Signals, run reliability tests, and generate reliability scores.
Sep 2, 2022   |  By Gremlin
In this video, we show you how to add a Golden Signal to a service. Gremlin uses your Golden Signals to ensure your services are still healthy and responsive during reliability tests. You can configure Golden Signals to use an existing monitor in your observability tools, such as Datadog, New Relic, or Prometheus. We recommend adding all four Golden Signals to each of your services to ensure comprehensive coverage.
Sep 2, 2022   |  By Gremlin
This short demo video shows you how to add a Kubernetes service to Gremlin Reliability Management (RM). We'll walk you through selecting the parts of your infrastructure that make up your service, identifying processes for dependency detection, and adding your Golden Signals.
Sep 2, 2022   |  By Gremlin
Gremlin Reliability Management helps teams standardize and automate reliability, one service at a time. In this video, we walk through the platform by showing you how to add your services to Gremlin, integrate your Golden Signals, run reliability tests, and generate reliability scores.
May 10, 2022   |  By Gremlin
Learn all about Gremlin's GameDay feature in this webinar presented by Sydney Lesser and Andre Newman. GameDays are organized team events to proactively improve reliability using Chaos Engineering principles. Gremlin makes it easier than ever to prepare, execute, and learn from them. Increase your system’s reliability with safe, secure, and simple GameDays.
May 10, 2022   |  By Gremlin
Learn how to run a GameDay in Gremlin. This video walks you through creating a GameDay, adding and running Scenarios, recording your observations, and linking to Jira in the Gremlin web app.
Sep 9, 2022   |  By Gremlin
Systems fail, sometimes publicly and at great cost. Airlines have experienced system-wide ticketing outages, causing hundreds of flight cancellations and significant inconvenience to customers. Retailers have experienced website crashes on the busiest shopping days of the year, costing millions in lost revenue and customer goodwill. It is vital to understand both DevOps and SRE and the roles they play in preventing such outages.
Mar 29, 2022   |  By Gremlin
Gremlin provides a variety of ways to test the resilience of your systems, which we call "attacks". Running different attacks lets you uncover unexpected behaviors, validate resilience mechanisms, and improve the overall reliability of your systems and services. This ebook explains each of Gremlin's attacks in complete detail, including what each attack does, how it impacts your systems, and the technical and business objectives the attack helps solve.
Jul 25, 2020   |  By Gremlin
Learn the basics of Chaos Engineering: discover the tools, tests, and culture needed to create better software and prevent outages and downtime. This whitepaper provides a comprehensive introduction to the discipline of Chaos Engineering including why it is more needed than ever, how to get started, and best practices to maximize learnings and reduce risk.
Jul 25, 2020   |  By Gremlin
By following this guide, you'll successfully increase your organization's reliability with minimal effort and risk. This document will serve as your guide to implementing Chaos Engineering and Gremlin within your organization. From educating your team on the principles of Chaos Engineering to running automated experiments, this guide will walk through each stage of the adoption process in order to ensure a smooth and successful rollout.
Jul 25, 2020   |  By Gremlin
Amazon DynamoDB is fast, powerful, and intended for high availability. These are all valuable attributes in a data storage solution, but to be useful as advertised, it must be configured thoughtfully. Learn how to use Chaos Engineering to ensure DynamoDB performs the way you expect. In this guide, we cover: Amazon DynamoDB is one of the most popular NoSQL databases and is the data store of choice for many teams running production workloads in AWS.
Jul 1, 2020   |  By Gremlin
Win over and convince your coworkers and management to explore and adopt Chaos Engineering and Site Reliability Engineering (SRE). The playbook provides ideas and techniques that can be used to articulate the need and benefits to internal stakeholders in your organization. It also guides the initial implementation in a way that will lead to success and growth across the organization. Implementing something new like Chaos Engineering successfully is a good way to get promoted and help the organization succeed, and this guide is here to help you.
Jul 1, 2020   |  By Gremlin
MongoDB is designed for performance, scale, and high-availability. But, as with any software, you need to test your configuration to verify that it will work as advertised. Ensure that MongoDB performs the way you expect by using Chaos Engineering to test four key features. This guide includes four experiment tutorials to verify that MongoDB will perform reliably: In order to ensure you get the most out of MongoDB's rich features, including built-in data sharding and replication, it's crucial to test your configuration.

Gremlin aims to make the internet more reliable and prevent costly and reputation-damaging outages. Its failure-as-a-service platform empowers engineers to build more resilient systems through safe experimentation.

Downtime is expensive and can hurt your brand. Gremlin provides engineers with the framework to safely, securely, and easily simulate real outages with an ever-growing library of attacks. Turn failure into resilience with chaos engineering.

Build resilient infrastructure:

  • Resource Gremlins: Throttle CPU, Memory, I/O, and Disk.
  • State Gremlins: Reboot hosts, kill processes, travel in time.
  • Network Gremlins: Introduce latency, blackhole traffic, lose packets, fail DNS.

Test for application failure:

  • Test for failure in your code.
  • Fail or delay serverless functions.
  • Narrow the impact to a single user, device, or percentage of traffic.

Avoid downtime. Use Gremlin to turn failure into resilience.