Operations | Monitoring | ITSM | DevOps | Cloud

Chaos Engineering

Achieving SLO Success with Golden Signals and Reliability Testing

The four Golden Signals are an easy and effective way to measure the most important aspects of a system, and when paired with a reliability management platform like Gremlin, they help you proactively meet your SLOs so you can meet your legal obligations and deliver the perfect customer experience.

5 essential resilience tests for a successful cloud migration

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Migrating to the cloud usually means faster deployments and easier scalability, but it also means latency. Cloud applications communicate over distributed networks, and while these networks are fast, little bits of latency can quickly add up.

How to test AWS managed services with Gremlin

Note In this blog, we use “managed service providers” to refer to companies that provide hosted computing services, not managed IT service providers (MSPs). ‍ When was the last time you thought about the reliability of your cloud dependencies? The biggest challenge with using cloud platforms and SaaS services is also its biggest strength: the provider controls everything.

Are you testing for known reliability vulnerabilities?

Are you testing for known reliability vulnerabilities? "Risks have different priorities, but ultimately we want to be aware of those risks. Just like we want our security team to go scan for known vulnerabilities, our reliability team should be scanning for known vulnerabilities. And those are easy things we should go address. There's a second part of it, which is kind of just good engineering testing, which is: Hey, we have a set of test cases that we know need to pass.

How role-based access control (RBAC) works in Gremlin

Reliability testing and Chaos Engineering are essential for finding reliability risks and improving the resiliency of systems. Gremlin makes it easy to do so, but not every engineer needs access to the same experiments, systems, or services. That’s why we released customizable role-based access controls (RBAC), letting Gremlin customers control which actions your users can perform in Gremlin.

Build reliability efforts into your regular engineering schedule

Improving reliability might seem daunting, but you'd be surprised how much impact you can have with a relatively light lift. "Reliability doesn't need to be everybody stopped the world for a month, kind of a tech debt thing. If we spent 20 minutes a week, we could actually save ourselves a ton of time over the course of the year. The business needs to be efficient and agile, but it's important that the reliability is there. And so we really need people to be able to react quickly, adapt, and do a little bit along the way.

Destroy on Friday: The Big Day A Chaos Engineering Experiment - Part 2

In my last blog post, I explained why we decided to destroy one third of our infrastructure in production just to see what would happen. This is part two, where I go over the big day. How did our chaos engineering experiment go? Find out below!

How to balance reliability with other DevOps priorities

Reliability efforts do take up some bandwidth, but in the end it's worth it—as our customers find out when their outage costs go down. "Everyone has their own priorities that they're dealing with. Given unlimited time and money, absolutely everyone would want to build the best possible system that is the most secure, performant, resilient, and everything.

How to Build Resilience Throughout Your SDLC Lessons from a Top 10 Bank

Are your applications as reliable as you planned? How do you know? The only way to ensure systems are resilient to common failure conditions is to test them, yet many large enterprises struggle with the effort and expense to do so. In this webinar, Anantha Movva, a former head of SRE and Performance Engineering at one of the top 10 North American banks, will share how he drove Chaos Engineering and resilience testing adoption throughout his organization.

Chaos Testing Explained

Chaos testing is a part of site reliability engineering (SRE). In chaos testing, we intentionally break things in and around a given application, in order to: The purpose of chaos testing is to assess how software systems respond to scenarios like network outages, hardware failures, database failures, and server or cluster node failures in the infrastructure.