Operations | Monitoring | ITSM | DevOps | Cloud

Gremlin

Integrating Gremlin with your observability tools

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. To get the most value out of Chaos Engineering and reliability testing, you need a way to observe your service’s behavior. Observability tools offer insight into how your systems are performing, but observability on its own isn’t enough. You need a way to monitor your systems while testing their reliability so you can determine whether your service passed or failed a test.

Building Resilience from Architecture to Production with AWS & Gremlin

Unreliable software can have a painful impact on your customers and your business—something we’ve all seen and felt during high-profile outages. And while building on the cloud with AWS unlocks improved scaling and reliability capabilities, the complexity of modern distributed systems can potentially introduce outage-causing reliability risks. How can you be sure your systems are resilient to failure when they’re based on complex architecture, built by hundreds of teams, and are being updated almost constantly?

How reliability engineering can verify disaster recovery plans

Disaster recovery plans have always been a crucial part of businesses—especially essential services like banks. These plans help keep your business up and running during a disaster or extreme scenario so you can be there for your customers when they need you the most.

Three serverless reliability risks you can solve today using Failure Flags

Serverless platforms make it incredibly easy to deploy applications. You can take raw code, push it up to a service like AWS Lambda, and have a running application in just a few seconds. The serverless platform provider assumes responsibility for hosting and operating the platform, freeing you up to focus on your application. Naturally, this raises a question: if something goes wrong, who’s responsible?

Best Practices for Testing Zone Redundancy

The way the story goes is that in the old days Amazon used to cut power to data centers so they could see if their services were actually redundant across different data centers; and that they only abandoned this practice when EC2 customers started to complain (no matter how many times they were warned their instances might disappear without notice). This story may be apocryphal, but you don’t need to be worried about power loss outages in order to have a given data center go down.

Office Hours: How to test serverless applications using Failure Flags

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Serverless applications are ideal for deploying scalable applications without having to manage infrastructure. However, this also makes it difficult to test their reliability. It’s easy to simulate a network outage or latency when you have direct access to the host that your software’s running on. What do you do when you only have control over the code?

How Visa Cross Border Solutions Reduces Outages by Testing System Resilience in Their SDLC

For global financial services companies, reliability must be built-in and validated before and after shipping to production. Resilience testing is crucial for verifying the reliability of your applications under real-world conditions. But ad-hoc testing and exploratory experiments aren't sufficient: you need to run automated, standardized tests at global scale.

Interpreting your reliability test results

Gremlin’s default suite of reliability tests analyzes critical functions of modern services: scalability, redundancy, and resilience to dependency failures. Services that pass this suite of tests can be trusted to remain available during unexpected incidents. But what happens when a service fails a test? How do you take failed test results and turn them into actionable insights? This blog aims to answer that question.

Office Hours: Get better reliability on AWS with our new release

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Cloud platforms make it easier than ever to deploy massively scalable, distributed workloads, but this is a double-edged sword. There are reliability challenges unique to the cloud that didn’t exist before. Failed migrations, recurring incidents, and reliability toil take their toll.

Release Roundup August 2024

Over the past year, the Gremlin team has focused on giving you more tools to adapt Gremlin to your organization’s reliability needs. We started with customizable reliability tests, and now, we’ve released customizable role-based access controls (RBAC). We’ve also made it easier to target specific availability zones when running Failure Flags experiments, and to run experiments behind a proxy. Keep reading to learn more! ‍