Latest News

Simulating artificial intelligence service outages with Gremlin

Mar 6, 2025 By Andre Newman In Gremlin

The AI (artificial intelligence) landscape wouldn’t be where it is today without AI-as-a-service (AIaaS) providers like OpenAI, AWS, and Google Cloud. These companies have made running AI models as easy as clicking a button. As a result, more applications have been able to use AI services for data analysis, content generation, media production, and much more.

Read Post

Gremlin

Read more about Simulating artificial intelligence service outages with Gremlin

Three reliability best practices when using AI agents for coding

Feb 26, 2025 By Gavin Cahill In Gremlin

One of the biggest causes of outages and incidents is good old-fashioned human error. Despite all of our best intentions, we can still make mistakes, like forgetting to change defaults, making small typos, or leaving conflicting timeouts in the code. It’s why 27.8% of unplanned outages are caused by someone making a change to the environment. Fortunately, reliability testing can help you catch these errors before they cause outages.

Read Post

Gremlin

Read more about Three reliability best practices when using AI agents for coding

How to Build Observability into Chaos Engineering

Feb 24, 2025 By Ujjwal Goyal In Last9

If you've ever deployed a distributed system at scale, you know things break—often in ways you never expected. That’s where Chaos Engineering comes in. But running chaos experiments without robust observability is like debugging blindfolded. This guide will walk you through how observability empowers Chaos Engineering, ensuring that your experiments yield meaningful insights instead of just causing chaos for chaos’ sake.

Read Post

Last9

Read more about How to Build Observability into Chaos Engineering

How to make your AI-as-a-Service more resilient

Feb 24, 2025 By Andre Newman In Gremlin

When you think about “AI reliability,” what comes to mind? If you’re like most people, you’re probably thinking of generative AI model accuracy, like responses from ChatGPT, Stable Diffusion, and Sora. While this is certainly important, there’s an even more fundamental type of reliability: the reliability of the infrastructure that your AI models and applications are running on. AI infrastructure is complex, distributed, and automated, making it highly susceptible to failure.

Read Post

Gremlin

Read more about How to make your AI-as-a-Service more resilient

Announcing Gremlin Private Edition

Feb 11, 2025 By Andre Newman In Gremlin

Today, we’re excited to announce Gremlin Private Edition, a version of Gremlin that you can run entirely within your own private network.

Read Post

Gremlin

Read more about Announcing Gremlin Private Edition

How the Gremlin agent fails safely

Jan 30, 2025 By Andre Newman In Gremlin

Testing shouldn’t feel risky. While it might sound counterintuitive, certain types of testing can actually increase risks to your systems. Load testing, for example, is a great way to see how your systems behave under pressure, but it can also cause those same systems to fail if they aren’t equipped to handle the load. For some types of testing, this is necessary, as is the case with reliability testing and Chaos Engineering.

Read Post

Gremlin

Read more about How the Gremlin agent fails safely

How to fix the root cause of a failed reliability test

Jan 21, 2025 By Andre Newman In Gremlin

You’re well on your way to becoming more reliable. You’ve added your services, found and fixed some Detected Risks, and run your first set of reliability tests. However, some of your tests returned as “Failed.” Not to worry: this isn’t a reflection of you or your engineering skills but rather an opportunity to learn more about how your systems work and, more importantly, how to make them more resilient.

Read Post

Gremlin

Read more about How to fix the root cause of a failed reliability test

Maximizing your reliability on AWS

Jan 13, 2025 By Andre Newman In Gremlin

Cloud providers like AWS excel at creating reliable platforms for developers to build on. But while the platforms may be rock-solid, this doesn’t guarantee your applications will be too. It’s the provider’s job to offer stable infrastructure, but you’re still on the hook for making your workloads resilient, recoverable, and fault-tolerant. There’s only one problem: cloud platforms are essentially black boxes.

Read Post

Gremlin

Read more about Maximizing your reliability on AWS

What's the ROI of reliability?

Jan 13, 2025 By Gavin Cahill In Gremlin

Reliability doesn’t happen by itself. Making a system reliable and resilient enough that your customers can count on it takes a combination of time, effort, and resources that could be used elsewhere, such as shipping new features. It’s also not optional. In an era where downtime costs an average of $14,056/min (or $843,360/hr), outages have a material impact on businesses. Unfortunately, most systems are sprawling and complex enough that even small amounts of downtime can add up quickly.

Read Post

Gremlin

Read more about What's the ROI of reliability?

Manage your reliability work more easily with Gremlin's newest features

Jan 6, 2025 By Andre Newman In Gremlin

Reliability testing is ongoing work, and tracking that work can be difficult in large organizations. Engineers run one-off experiments, scheduled Scenarios run in the background, and, for more mature teams, CI/CD workflows fire off automated tests on demand. According to our own product metrics, teams run an average of 200 to 500 tests each day! With so much happening, it’s hard to keep track of everything going on in Gremlin—until now.

Read Post

Gremlin

Read more about Manage your reliability work more easily with Gremlin's newest features

Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

Simulating artificial intelligence service outages with Gremlin

Three reliability best practices when using AI agents for coding

How to Build Observability into Chaos Engineering

How to make your AI-as-a-Service more resilient

Announcing Gremlin Private Edition

How the Gremlin agent fails safely

How to fix the root cause of a failed reliability test

Maximizing your reliability on AWS

What's the ROI of reliability?

Manage your reliability work more easily with Gremlin's newest features

Monthly Archive

Follow Us