Chaos Engineering

What's the reliability of your checkout process?

Jul 7, 2020 By Jacob Plicque III In Gremlin

One of the reasons companies practice Chaos Engineering is to prevent expensive outages in retail (or anywhere, for that matter) from happening in the first place. This blog post walks through a common retail outage where the checkout process fails, then covers how to use Chaos Engineering to prevent the outage from ever happening in the first place. Let’s dive in. Maybe you’ve been there.

Read Post

Gremlin

Read more about What's the reliability of your checkout process?

Building more reliable financial systems with Chaos Engineering

Jul 2, 2020 By Taylor Smith In Gremlin

The financial services industry has built in more capital buffers to prevent market shocks from bringing another economic collapse. In addition to these financial controls, many banks and personal trading platforms have begun building resiliency into information technology shocks. Despite these new precautions, we’re still seeing outages today, preventing customers from depositing and withdrawing their money, completing transactions, and executing trades during key events.

Read Post

Gremlin

Read more about Building more reliable financial systems with Chaos Engineering

Gremlin CTO Matthew Fornaciari at the Virtual CTO Summit

Jul 1, 2020 By Gremlin In Gremlin

In this talk, Gremlin CTO Matthew Fornaciari discusses moving from reactive to proactive operations.

View Video

Gremlin

Read more about Gremlin CTO Matthew Fornaciari at the Virtual CTO Summit

Performance tuning MongoDB with Chaos Engineering

Jun 26, 2020 By Andre Newman In Gremlin

You’ve pored over the MongoDB documentation, crafted highly polished and well-tuned queries, and confidently deployed your new code to production. Everything ran great at first, but once CPU or RAM usage hit a certain point, your queries suddenly slowed to a crawl. What happened, and how can you prepare for situations like this in the future? This is an unfortunate but common scenario with databases like MongoDB.

Read Post

Gremlin

Read more about Performance tuning MongoDB with Chaos Engineering

Announcing Status Checks to Ensure Safe Chaos Engineering Scenarios

Jun 23, 2020 By Matt Schillerstrom In Gremlin

One of the most important aspects of any Chaos Engineering program is knowing that every experiment is being run safely. And one of the simplest ways to ensure safe experiments is by having safeguards that prevent running chaos experiments on a system that is unhealthy or has an incident in progress. Today, Gremlin is excited to announce Status Checks, which run before you kick off a Chaos Engineering Scenario in order to verify your system is in a steady state.

Read Post

Gremlin

Read more about Announcing Status Checks to Ensure Safe Chaos Engineering Scenarios

Chaos Engineering and Windows: Mitigating common Windows failure scenarios

Jun 18, 2020 By Matthew Helmke In Gremlin

Microsoft Windows is a popular operating system for many enterprise applications, such as Microsoft SQL Server clusters and Microsoft Exchange Servers. About 30% of the world’s web application hosting systems are running Windows, making it an important part of every enterprise’s plans to prevent outages and enhance reliability.

Read Post

Gremlin

Read more about Chaos Engineering and Windows: Mitigating common Windows failure scenarios

Achieving AWS DevOps Competency Status (and What it Means for You)

Jun 16, 2020 By Eugene Wu In Gremlin

Chaos Engineering was conceived as a direct response to the complexity and nondeterministic nature of cloud-based applications. Thoughtful fault injection closes the gap between traditional testing methodologies and modern approaches to software engineering like microservices, continuous delivery, and DevOps.

Read Post

Gremlin

Read more about Achieving AWS DevOps Competency Status (and What it Means for You)

Improving a Distributed System Post-Incident Julius Zerwick Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

In this session, we will dive into a case study of how a team can recover & improve a distributed system after a major incident. Distributed systems are more prone to failure than other systems due to their incredible complexity and scale, and incidents are a fact of life with these systems.

View Video

Gremlin

Read more about Improving a Distributed System Post-Incident Julius Zerwick Failover Conf 2020

Reliability Matters More Than Ever Tammy Butow Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

Chaos and uncertainty are all around us. Tammy Butow kicks off Failover Conf by sharing why reliability and resilience matter now more than ever — and how you can achieve it.

View Video

Gremlin

Read more about Reliability Matters More Than Ever Tammy Butow Failover Conf 2020

Built-in Application Resiliency Allan Shone Failover Conf 2020

May 5, 2020 By Gremlin In Gremlin

When starting a new application build, starting with an eye on resiliency prevents headaches down the line. There are many ways to tackle this, especially within different language environments and system eco-systems, but there are many shared across them all. Getting a high-level take-away list to use as a reference later, from a dive into them during this talk, viewers will learn how to develop software that is more fault-tolerant and able to with-stand impact of failures.

View Video