Chaos Engineering


How Netflix Uses Fault Injection To Truly Understand Their Resilience

Distributed systems such as microservices have defined software engineering over the last decade. The majority of advancements have been in increasing resilience, flexibility, and rapidity of deployment at increasingly larger scales. For streaming giant Netflix, the migration to a complex cloud based microservices architecture would not have been possible without a revolutionary testing method known as fault injection. With tools like chaos monkey, Netflix employs a cutting edge testing toolkit.


Announcing our latest attacks to deal with meeting fatigue

Gremlin empowers you to proactively root out failure before it causes downtime. See how you can harness chaos to build resilient systems by requesting a demo of Gremlin. With everyone working remotely, video conference tools like Zoom have been a critical part of maintaining business continuity. It’s truly amazing that we can continue to work and connect with one another, even during a time where getting together in an office hasn’t been possible…


How we approached IT incident predictions through chaos theory

In this blog, we have introduced the basics of chaos theory and complex systems, including how system incidents and failure prediction have been tackled in the past through deep learning. We have also offered ideas on how chaotic time series analysis can be leveraged to approach this problem. IT teams in large technology companies spend years building networked systems and connected applications. Now, especially in the middle of a worldwide pandemic, software reliability is paramount.


It's all Chaos! And it Makes for Resilience at Scale

Chaos engineering is a practice where engineers simulate failure to see how systems respond. This helps teams proactively identify and fix preventable issues. It also helps teams prepare responses to the types of issues they cannot prevent, such as sudden hardware failure. The goal of chaos engineering is to improve the reliability and resilience of a system. As such, it is an essential part of a mature SRE solution.


Understanding Chaos Engineering And Why It Matters

If you’ve ever gone through the pain and anxiety of responding to an unexpected failure in your production system, then intentionally breaking things in production is probably not anywhere on your current “to do” list. However, the principle of chaos engineering intentionally breaks parts of a production process to test its resilience. Such experiments’ intended outcomes are not at all the same as the unplanned outage you may have experienced.


Validating the resilience of your API gateway with Chaos Engineering

Get started with Gremlin's Chaos Engineering tools to safely, securely, and simply inject failure into your systems to find weaknesses before they cause customer-facing issues. API gateways are a critical component of distributed systems and cloud-native deployments. They perform many important functions including request routing, caching, user authentication, rate limiting, and metrics collection. However, this means that any failures in your API gateway can put your entire deployment at risk.

vmware tanzu

Chaos Engineering, Explained

Chaos engineering has definitely become more popular in the decade or so since Netflix introduced it to the world via its Chaos Monkey service, but it’s far from ubiquitous. However, that will almost certainly change over time as more organizations become familiar with its core concepts, adopt application patterns and infrastructure that can tolerate failure, and understand that an investment in reliability today could save millions of dollars tomorrow.


What is Chaos Engineering and How to Implement It

Chaos Engineering is one of the hottest new approaches in DevOps. Netflix first pioneered it back in 2008, and since then it’s been adopted by thousands of companies, from the biggest names in tech to small software companies. In our age of highly distributed cloud-based systems, Chaos Engineering promotes resilient system architectures by applying scientific principles. In this article, I’ll explain exactly what Chaos Engineering is and how you can make it work for your team.


How to test for expired TLS/SSL certificates using Gremlin

Transport Layer Security (TLS), and its preceding protocol, Secure Sockets Layer (SSL), are essential components of the modern Internet. By encrypting network communications, TLS protects both users and organizations from publicly exposing their in-transit data to third parties. This is especially true for the web, where TLS is used to secure HTTP traffic (HTTPS) between backend servers and customers’ browsers.