Operations | Monitoring | ITSM | DevOps | Cloud

Chaos Engineering

Apica

Own Your Multiverse with Chaos Engineering

The theory of the multiverse states that every possible outcome takes place somewhere, and from sausage fingers to pig superheroes from another dimension, Hollywood loves showing us how differently things could play out if one small change is made. But since in reality we can’t see everything that will happen — or everything that will go wrong in our IT tech stack — it’s important to prepare for every kind of challenge you can imagine.

gremlin

The KPIs of improved reliability

This article was originally published on May 5, 2022. For many businesses, prioritizing reliability is an ongoing challenge. Building reliable systems and services is critical for growing revenue and customer trust, but other initiatives—like building new products and features—often take precedence since they provide a clearer and more immediate return. That's not to say reliability doesn't have clear value, but proving this value to business leaders can be tricky.

gremlin

How to test for expired TLS/SSL certificates using Gremlin

Transport Layer Security (TLS), and its preceding protocol, Secure Sockets Layer (SSL), are essential to the modern Internet. Encrypting network communications using TLS protects users and organizations from publicly exposing in-transit data to third parties. This is especially important for the web, where TLS secures HTTP traffic (HTTPS) between backend servers and customers’ browsers.

cloudify

Chaos Experiments as Day-2 Operations

Chaos engineering is a hot topic in the platform engineering field as organizations try to build more robust applications. Chaos engineering involves deliberately injecting faults into a system to observe its behavior and build confidence in the behavior of that system. The Chaos Monkey from Netflix pioneered the concept of deliberately inflicting chaos on a production system, but the discipline has grown extensively since this initial project.

gremlin

How reliability testing and load testing are complementary

How can you tell if your systems are reliable when under load? A common answer is to open your observability dashboards, wait for a high-traffic event (like Black Friday), and cross your fingers. While this approach is certainly effective, it's far from ideal. Without proactive reliability and load testing, we have no idea if a system will hold up to real-world usage patterns, which could mean a production outage at the worst possible time.

gremlin

How to identify and map service dependencies

Modern applications are a web of interdependent services. As applications grow in size and complexity, and as more engineering teams adopt service-based architectures like microservices, this web becomes deeper and denser. Eventually, keeping track of the interdependencies between services becomes a complex and time-consuming task in and of itself. In addition, if any of these dependencies fails, it can have cascading impacts on the rest of your services and on the application as a whole.

gremlin

Managing and improving reliability using Gremlin's Reliability Dashboard

Part of a successful reliability program is being able to monitor and review your progress toward improving reliability. Being able to run tests on services is a big part of it, but how can you tell you're making progress if you can only see your latest test results? There should be a way to track improvements or regressions in your reliability testing practice across your organization in a way that's easy to digest. That's where the Reliability Dashboard comes in.