Operations | Monitoring | ITSM | DevOps | Cloud

Chaos Engineering

What is Chaos Engineering? A Guide on Its History, Key Principles, and Benefits

Many organizations invest in high availability and disaster recovery for their key applications. Too many of these organizations, however, forego the most important aspect of this process—testing the failover process regularly. Whether gripped by the fear of downtime or dreaded DNS problems, development teams are frequently hesitant to test out what they’ve built in the real world.

Why SREs Need to Embrace Chaos Engineering

Reliability and chaos might seem like opposite ideas. But, as Netflix learned in 2010, introducing a bit of chaos—and carefully measuring the results of that chaos—can be a great recipe for reliability. Although most software is created in a tightly controlled environment and carefully tested before release, the production environment is harsher and much less controlled.

How to define and measure the reliability of a service

More and more teams are moving away from monolithic applications and towards microservice-based architectures. As part of this transition, development teams are taking more direct ownership over their applications, including their deployment and operation in production. A major challenge these teams face isn't in getting their code into production (we have containers to thank for that), but in making sure their services are reliable.

How Gremlin's reliability score works

In order to make reliability improvements tangible, there needs to be a way to quantify and track the reliability of systems and services in a meaningful way. This "reliability score" should indicate at a glance how likely a service is to withstand real-world causes of failure without having to wait for an incident to happen first. Gremlin's upcoming feature allows you to do just that.

Chaos Engineering Tools: Build vs Buy

Chaos Engineering, where engineers intentionally inject failure to test the reliability of their systems, is becoming a regular practice for companies who value uptime and availability. As cloud-based systems have grown more complex, Chaos Engineering has become a critical part of the software testing and release process to uncover surprise dependencies, fix problems before they become 3am outages, and bake reliability into every feature.

How Does Chaos Engineering Work?

Chaos testing is a way to test the integrity of a system. Its purpose is to simulate failures that could crash a production system in a controlled environment. This helps to identify failures before they cause unplanned downtime that disrupts the user experience. Unlike standard testing, which tests a system response against a predefined result, chaos testing does not have a predefined result. Rather, the entire purpose of the experiment is to find out new information about the system.

Podcast: Break Things on Purpose | Developer Advocacy and Innersource with Aaron Clark

In this episode, Jason chats with Aaron Clark, Director of Developer Advocacy at the Royal Bank of Canada. Aaron shares what it was like starting out as a developer at RBC and working in early cloud development, and then transitioning to his role as a developer advocate. Jason and Aaron talk about the value applying open source principles within organizations, or “innersource.” Their time ends with a discussion on continuing education and how to keep learning.