Operations | Monitoring | ITSM | DevOps | Cloud

What is the Well-Architected Cloud Test Suite?

When it comes to reliability, cloud providers use a Shared Responsibility Model. In essence, they’ll keep the infrastructure reliable, while you’re responsible for architecting reliability into your systems. To help make this easier, they’ve published a variety of best practice guides, such as the AWS Well-Architected Framework. These lengthy documents are filled with recommendations to help you architect a more secure, more reliable system.

How to prevent accidental load balancer deletions

The worst thing you could do after successfully deploying to a new environment is to accidentally delete critical infrastructure. Unfortunately, that happened to one Google Cloud customer when their private cloud subscription was accidentally deleted, resulting in nearly two weeks of downtime. This isn’t an isolated problem either: Microsoft Azure had a similar problem when a typo inadvertently deleted an entire SQL Server instance rather than a specific database.

Observability and incident response need resilience testing

There’s a reason why observability and incident response practices have become standard across modern software development. Anyone wanting to minimize downtime and deliver reliable, available applications needs to have fully instrumented systems and playbooks so they can respond quickly and effectively to outages or incidents. But there’s another piece to the reliability puzzle: resilience testing.

Reward engineers who fix problems before they cause outages

Are you recognizing the good work engineers do to prevent outages? "The people that are out there doing good work to prevent fires from ever occurring, we're not often recognizing them. We're not often rewarding them. And once things go wrong, someone comes in and fixes it. That's great. That's needed. But we're rewarding that behavior. And so it becomes a bit of people are motivated by what behavior you reward.

Use the Gremlin API to add Chaos Engineering to your pipeline

Did you know you can use the Gremlin API to integrate resiliency tests into your CI/CD pipeline? Our partner Nagarro has even made it part of their shift left package. "What we do is shift left and add a chaos stage to the pipeline. We have created the shift left accelerator package. It integrates with load tests and Gremlin APIs to set up the test scenario.

Gremlin for AWS: Demo from Install to Testing

Gremlin for AWS is a suite of tools to more easily find and fix the reliability risks that cause downtime on AWS. The cloud opens up a range of reliability challenges that didn’t exist before, especially for customers running distributed, mission-critical workloads. Teams experience the pain of failed migrations, frequent incidents, and reliability toil, but often struggle to modernize their approach to reliability as they modernize their infrastructure. That’s where Gremlin for AWS can help.

Want more software reliability? It starts with leadership

If you want to improve reliability, it has to be important from the top down. "As part of the CTO or leadership owning it, they need to tell folks that it's important in the product roadmap, in some of the development schedule, that we spend time on it, that the CEO is the person that holds people accountable, that they review the metrics, that they sit in the outages, that they understand the quality of the software.

Introducing Gremlin for AWS

Today, Gremlin is introducing Gremlin for AWS, a suite of tools to more easily find and fix the reliability risks that cause downtime on AWS. The cloud opens up a range of reliability challenges that didn’t exist before, especially for customers running distributed, mission-critical workloads. Teams experience the pain of failed migrations, frequent incidents, and reliability toil, but often struggle to modernize their approach to reliability as they modernize their infrastructure.

Don't measure reliability with a lagging indicator like downtime or MTTR

Your reliability measurement can't just be a lagging indicator. "How do you know your company is doing well at reliability? A lot of people will just look at how many outages have you had in the last year and how much customer pain have you caused? I think that's one side of the coin. That's the reactive lagging indicator of the health of our system. To really be good at this, we need a way to understand the risks and the sharp points so that we have an idea of what we're getting into.