Chaos Engineering

How to fix Kubernetes init container errors

Dec 14, 2023 By Andre Newman In Gremlin

One of the most frustrating moments as a Kubernetes developer is when you go to launch your pod, but it fails to start because of a problem during initialization. Init containers are incredibly useful for setting up a pod before handing it off to the main container, but they introduce an additional point of failure. In this post, we'll take an in-depth look at init containers in Kubernetes: what they are, how they work, how they can fail, and what that means for your Kubernetes deployments.

Read Post

Gremlin

Read more about How to fix Kubernetes init container errors

Release Roundup Dec 2023: Driving reliability standards (and much more)

Dec 12, 2023 By Andre Newman In Gremlin

2023 is coming to a close and the holiday season is here, but that doesn’t mean things at Gremlin are slowing down. In fact, we’ve released a ton of new features and improvements to make testing and improving reliability even easier. Now you can run Chaos Engineering experiments in serverless environments, create custom reliability test suites, create more flexible Scenarios, and more easily identify critical components in your environment.

Read Post

Gremlin

Read more about Release Roundup Dec 2023: Driving reliability standards (and much more)

Failure Flags helps build testable, reliable software-without touching infrastructure

Dec 11, 2023 By Ryan Detwiller In Gremlin

Building provably reliable systems means building testable systems. Testing for failure conditions is the only way to reliably root out issues before they impact customers. However, most current Chaos Engineering and resilience testing is focused on the underlying infrastructure. This helps identify potentially catastrophic failures, but misses the more frequent failures that still significantly impact customer experience.

Read Post

Gremlin

Read more about Failure Flags helps build testable, reliable software-without touching infrastructure

Monitor your chaos engineering experiments with Steadybit's offering in the Datadog Marketplace

Dec 8, 2023 By Candace Shamieh In Datadog

Steadybit is a software reliability platform that uses chaos engineering and fault injection to help organizations improve the stability and performance of their applications. By allowing customers to simulate turbulent scenarios in a controlled environment, Steadybit enables you to identify and mitigate potential system issues to reduce downtime and improve resilience.

Read Post

Datadog

Read more about Monitor your chaos engineering experiments with Steadybit's offering in the Datadog Marketplace

Introducing Custom Reliability Test Suites, Scoring and Dashboards

Nov 16, 2023 By Ryan Detwiller In Gremlin

Last year, we released Reliability Management, a combination of pre-built reliability tests and scoring to give you a consistent way to define, test, and measure progress toward reliability standards across your organization. Today, we fulfill the next stage of that promise with the release of Custom Reliability Test Suites, Custom Scoring, and Dashboards.

Read Post

Gremlin

Read more about Introducing Custom Reliability Test Suites, Scoring and Dashboards

Treat reliability risks like security vulnerabilities by scanning and testing for them

Nov 13, 2023 By Gavin Cahill In Gremlin

Finding, prioritizing, and mitigating security vulnerabilities is an essential part of running software. We’ve all recognized that vulnerabilities exist and that new ones are introduced on a regular basis, so we make sure that we check for and remediate them on a regular basis. Even if the code passed all the security checks before being deployed, you still perform regular security tests to make sure everything’s secure.

Read Post

Gremlin

Read more about Treat reliability risks like security vulnerabilities by scanning and testing for them

Building a Culture of Reliability: Why SREs Can't Do It Alone

Nov 3, 2023 By Gremlin In Gremlin

Join Gremlin CTO and Founder Kolton Andrus to hear practical strategies for building a collaborative culture of reliability. High-velocity DevOps orgs and complex cloud-native architectures have made reliability harder than ever. Organizations are turning to SREs to make sure systems are reliable, but with so many stakeholders and competing priorities, many companies are still struggling to get ahead of the outages and incidents—SREs simply can't do it all by themselves.

View Video

Gremlin

Read more about Building a Culture of Reliability: Why SREs Can't Do It Alone

How to fix and prevent ImagePullBackOff events in Kubernetes

Oct 24, 2023 By Andre Newman In Gremlin

You'll often hear the term "containers" used to refer to the entire landscape of self-contained software packages: this includes tools like Docker and Kubernetes, platforms like Amazon Elastic Container Service (ECS), and even the process of building these packages. But there's an even more important layer that often gets overlooked, and that's container images.

Read Post

Gremlin

Read more about How to fix and prevent ImagePullBackOff events in Kubernetes

How to fix and prevent CrashLoopBackOff events in Kubernetes

Oct 18, 2023 By Andre Newman In Gremlin

It's one of the most dreaded words among Kubernetes users. Regardless of your software engineering skill or seniority level, chances are you've seen it at least once. There are a quarter of a million articles on the subject, and countless developer hours have been spent troubleshooting and fixing it. We're talking, of course, about CrashLoopBackOff.

Read Post

Gremlin

Read more about How to fix and prevent CrashLoopBackOff events in Kubernetes

What is Gremlin?

Oct 17, 2023 By Gremlin In Gremlin

Today’s technology leaders are facing a reliability gap. Customers expect their apps to be fast and available. But with Devops and distributed systems driving more speed and complexity, it’s harder than ever to find and fix the reliability risks that can impact customer experience–before it’s too late. To close the Reliability gap, we need a reliability strategy. One that’s proactive, measurable, built-in and automated. We need a reliability platform.

View Video