Latest Posts

Where to automate resilience testing in your SDLC

Apr 9, 2024 By Ryan Detwiller In Gremlin

When organizations begin to deploy resilience testing or Chaos Engineering, there’s a natural question: can we integrate this with our CI/CD pipeline or release automation tools? After all, you’re likely running unit, performance, and integration tests already—is resiliency different? The short answer is yes—to both. Integration is possible, but resiliency is different, so automation is a nuanced conversation.

Read Post

Gremlin

Read more about Where to automate resilience testing in your SDLC

Resiliency is different on AWS: Here's how to manage it

Apr 2, 2024 By Andre Newman In Gremlin

There’s a common misconception about running workloads in the cloud: the cloud provider is responsible for reliability. After all, they’re hosting the infrastructure, services, and APIs. That leaves little else for their customers to manage, other than the workloads themselves…right?

Read Post

Gremlin

Read more about Resiliency is different on AWS: Here's how to manage it

Fault Injection in your release automation

Mar 18, 2024 By Sam Rossoff In Gremlin

One of the real successes of the Agile Software development movement has been the push to have regular, frequent deployments. This has manifested as build and deployment automation and the general adoption of CI/CD. As engineers automate more processes of their software release lifecycle, an important question is how to automate Quality Assurance, which includes resilience testing and, more specifically, Fault Injection.

Read Post

Gremlin

Read more about Fault Injection in your release automation

How to scale your systems based on CPU utilization

Mar 14, 2024 By Andre Newman In Gremlin

CPU usage is one of the most common metrics used in observability and cloud computing. It’s for a good reason: CPU usage represents the amount of work a system is performing, and if it’s near 100% capacity, adding more work could make the system unstable. The solution is to scale - add more hosts with more CPU capacity, migrate some of your workloads to the new host, and split the traffic between them using a load balancer.

Read Post

Gremlin

Read more about How to scale your systems based on CPU utilization

Release Roundup March 2024: More ways to discover and test your services

Mar 12, 2024 By Andre Newman In Gremlin

2024 is off to a fast start here at Gremlin. Since our last release roundup, we’ve released new experiment types, new features to improve integration with cloud platforms, and improvements to our auto-detection processes. Now you can push processes to their limits, find dependencies even easier, limit when tests can be run, and much more. We also introduced a slew of platform improvements to improve efficiency, performance, and user experience in the Gremlin web application.

Read Post

Gremlin

Read more about Release Roundup March 2024: More ways to discover and test your services

Introducing Process Exhaustion: How to scale your services without overwhelming your systems

Mar 11, 2024 By Andre Newman In Gremlin

We rarely think about how many processes are running on our systems. Modern CPUs are powerful enough to run thousands of processes concurrently, but at what point do our systems become oversaturated? When you’re running large-scale distributed applications, you might reach this limit sooner than you'd expect. How can you determine what that limit is, and how does that affect the number and complexity of the workloads you deploy?

Read Post

Gremlin

Read more about Introducing Process Exhaustion: How to scale your services without overwhelming your systems

How to validate memory-intensive workloads scale in the cloud

Mar 6, 2024 By Andre Newman In Gremlin

Memory is a surprisingly difficult thing to get right in cloud environments. The amount of memory (also called RAM, or random-access memory) in a system indirectly determines how many processes can run on a system, and how large those processes can get. You might be able to run a dozen database instances on a single host, but that same host may struggle to run a single large language model.

Read Post

Gremlin

Read more about How to validate memory-intensive workloads scale in the cloud

Your reliability scorecard: How to measure and track service reliability

Mar 5, 2024 By Andre Newman In Gremlin

If your organization asked you to report on the reliability improvements you’ve made over the past 90 days, would you be able to pull up a report? If you’re like many engineers, this question might make you anxious. Reliability is a difficult metric to quantify in a meaningful way, let alone measure.

Read Post

Gremlin

Read more about Your reliability scorecard: How to measure and track service reliability

The case for Fault Injection testing in Production

Feb 27, 2024 By Sam Rossoff In Gremlin

Many organizations who are looking to introduce Fault Injection as a testing technique start with non-production environments, but don't always go back and reconsider that choice as they mature beyond initial assessment. However, there's a strong case for running these tests in your live systems. It's important to consider the trade-offs when choosing to test in production or non-production environments, as it can have far-reaching impacts on the efficacy and cost of improving the resilience of software.

Read Post

Gremlin

Read more about The case for Fault Injection testing in Production

How to use host redundancy to improve service reliability and availability

Feb 22, 2024 By Andre Newman In Gremlin

Cloud computing has made provisioning new servers easy, fast, and relatively cheap. Almost anyone can log into a cloud console, spin up a new server, and deploy an application. And if they need greater uptime, major cloud providers include all kinds of settings, services, and configurations to add fault tolerance and failover. So why is it that many services fail when a single server instance fails?

Read Post

Gremlin

Read more about How to use host redundancy to improve service reliability and availability

Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Where to automate resilience testing in your SDLC

Resiliency is different on AWS: Here's how to manage it

Fault Injection in your release automation

How to scale your systems based on CPU utilization

Release Roundup March 2024: More ways to discover and test your services

Introducing Process Exhaustion: How to scale your services without overwhelming your systems

How to validate memory-intensive workloads scale in the cloud

Your reliability scorecard: How to measure and track service reliability

The case for Fault Injection testing in Production

How to use host redundancy to improve service reliability and availability

Monthly Archive

Follow Us