Latest Posts

The two kinds of failure testing

Feb 21, 2024 By Sam Rossoff In Gremlin

Fault injection is a tool, and like all tools, there are a variety of ways operators can employ it, but most of them tend to fall into one of two categories.

Read Post

Gremlin

Read more about The two kinds of failure testing

10 Most Common Kubernetes Reliability Risks

Feb 14, 2024 By Gavin Cahill In Gremlin

Reliability risks are potential points of failure in your system where an outage could occur. If you can find and remediate reliability risks, then you can prevent incidents before they happen. In complex Kubernetes systems, these reliability risks can take a wide variety of forms, including node failures, pod or container crashes, missing autoscaling rules, misconfigured load balancing or application gateway rules, pod crash loops, and more. And they’re more prevalent than you might think.

Read Post

Gremlin

Read more about 10 Most Common Kubernetes Reliability Risks

How dependency discovery works in Gremlin

Feb 13, 2024 By Andre Newman In Gremlin

Modern applications are rarely created entirely from scratch. Instead, they rely on a framework of pre-existing applications and services, each adding specific features and functionality. These dependencies empower teams to build and deploy applications more efficiently, but they bring their own set of challenges. Tracking, managing, and updating these dependencies is difficult, especially in large, complex applications where dependencies are likely managed by different teams.

Read Post

Gremlin

Read more about How dependency discovery works in Gremlin

How to make your services zone redundant

Feb 8, 2024 By Andre Newman In Gremlin

In January of 2020, an entire availability zone (AZ) in AWS’ Sydney region suddenly went dark. Multiple facilities lost power, preventing customers from accessing EC2 instances and Elastic Block Storage (EBS) volumes. Customers who didn’t have backup infrastructure in another zone had to wait nearly 8 hours before service was restored, and even then, some EBS volumes couldn’t be recovered. Major cloud provider outages are rare, but they happen nonetheless.

Read Post

Gremlin

Read more about How to make your services zone redundant

Measuring the impact of your reliability work with reports

Feb 6, 2024 By Andre Newman In Gremlin

Improving reliability is important, but how do you prove that your efforts are having an impact? A critical part of reliability work is having the tools to measure and track your progress. Gremlin supports this by providing several built-in reports, which update automatically and are available on-demand. This blog post is a quick introduction to Gremlin’s reporting capabilities.

Read Post

Gremlin

Read more about Measuring the impact of your reliability work with reports

Reducing cloud reliability risks with the AWS Well-Architected Framework

Feb 1, 2024 By Andre Newman In Gremlin

Designing and deploying applications in the cloud can be a labyrinthian exercise. There are dozens of cloud providers, each offering dozens of services, and each of those services has any number of configurations. How are you supposed to architect your systems in a way that gives your customers the best possible experience? AWS recognized this, and in response, they created the AWS Well-Architected Framework (WAF) to guide customers.

Read Post

Gremlin

Read more about Reducing cloud reliability risks with the AWS Well-Architected Framework

How Gremlin's dependency discovery feature works

Jan 22, 2024 By Andre Newman In Gremlin

Modern applications are rarely created entirely from scratch. Instead, they rely on a framework of pre-existing applications and services, each adding specific features and functionality. These dependencies empower teams to build and deploy applications more efficiently, but they bring their own set of challenges. Tracking, managing, and updating these dependencies is difficult, especially in large, complex applications where dependencies are likely managed by different teams.

Read Post

Gremlin

Read more about How Gremlin's dependency discovery feature works

How to troubleshoot unschedulable Pods in Kubernetes

Dec 19, 2023 By Andre Newman In Gremlin

Kubernetes is built to scale, and with managed Kubernetes services, you can deploy a Pod without having to worry about capacity planning at all. So why is it that Pods sometimes become stuck in an "Unschedulable" state? How do you end up with Pods that have been "Pending" for several minutes? In this blog, we'll dig into the reasons Pods fail to schedule. We'll look at why it happens, how to troubleshoot it, and ways you can prevent it.

Read Post

Gremlin

Read more about How to troubleshoot unschedulable Pods in Kubernetes

How to fix Kubernetes init container errors

Dec 14, 2023 By Andre Newman In Gremlin

One of the most frustrating moments as a Kubernetes developer is when you go to launch your pod, but it fails to start because of a problem during initialization. Init containers are incredibly useful for setting up a pod before handing it off to the main container, but they introduce an additional point of failure. In this post, we'll take an in-depth look at init containers in Kubernetes: what they are, how they work, how they can fail, and what that means for your Kubernetes deployments.

Read Post

Gremlin

Read more about How to fix Kubernetes init container errors

Release Roundup Dec 2023: Driving reliability standards (and much more)

Dec 12, 2023 By Andre Newman In Gremlin

2023 is coming to a close and the holiday season is here, but that doesn’t mean things at Gremlin are slowing down. In fact, we’ve released a ton of new features and improvements to make testing and improving reliability even easier. Now you can run Chaos Engineering experiments in serverless environments, create custom reliability test suites, create more flexible Scenarios, and more easily identify critical components in your environment.

Read Post

Gremlin

Read more about Release Roundup Dec 2023: Driving reliability standards (and much more)

Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

The two kinds of failure testing

10 Most Common Kubernetes Reliability Risks

How dependency discovery works in Gremlin

How to make your services zone redundant

Measuring the impact of your reliability work with reports

Reducing cloud reliability risks with the AWS Well-Architected Framework

How Gremlin's dependency discovery feature works

How to troubleshoot unschedulable Pods in Kubernetes

How to fix Kubernetes init container errors

Release Roundup Dec 2023: Driving reliability standards (and much more)

Monthly Archive

Follow Us