Latest Posts

Creating an agentic feedback loop with reliability guardrails

Jun 25, 2026 By Gavin Cahill In Gremlin

Reliability guardrails help make sure that your applications stay reliable without slowing down. In an earlier blog, we went into why agentic AI development needs reliability guardrails. It went over how the increased speed of AI development demands automated guardrails to verify resilience and what kinds of tests these guardrails should cover. But that’s only the beginning. By themselves, guardrails act as a gate to ensure resilience mechanisms hold under rapid changes.

Read Post

Gremlin

Read more about Creating an agentic feedback loop with reliability guardrails

Why agentic AI development needs reliability guardrails

May 15, 2026 By Gavin Cahill In Gremlin

AI has massively accelerated code deployment. In fact, since the introduction of agentic coding, GitHub has seen exponential growth in PRs, commits, and new repos. What they originally predicted would require 10X capacity, they’re now estimating it’s going to require 30X capacity, and the biggest driver is agentic development. Companies across industries are building agentic pipelines to ship features faster than ever before. That acceleration isn’t without risk.

Read Post

Gremlin

Read more about Why agentic AI development needs reliability guardrails

The hidden reliability risks in your agentic AI workflows

Mar 17, 2026 By Andre Newman In Gremlin

Artificial intelligence recently took a major leap from “saying” to “doing.” Instead of simple back-and-forth chats, we’re now allowing automated AI processes to take action on our behalf—from responding to emails to building and deploying complete applications. This shift from “assistant” to “actor” can make applications more capable, but it also creates additional failure modes.

Read Post

Gremlin

Read more about The hidden reliability risks in your agentic AI workflows

How Gremlin makes disaster recovery testing easier and faster

Mar 4, 2026 By Gavin Cahill In Gremlin

There’s a common saying: “A backup isn’t a backup until you’ve tested it.” The same is true whether it’s a simple database failover or an entire data center/cloud provider failover. You simply won’t know if it works if you don’t test it. When it comes to disaster recovery testing, that can be an expensive, painful, and arduous process. But it’s required by companies for a reason. And not just for disasters like hurricanes, flooding, or earthquakes.

Read Post

Gremlin

Read more about How Gremlin makes disaster recovery testing easier and faster

Announcing Disaster Recovery Testing

Feb 3, 2026 By Andre Newman In Gremlin

Today, we’re launching a new approach to running disaster recovery tests, validating failover processes, and ensuring compliance with regulations such as DORA. With Disaster Recovery Testing, you can run zone, region, and datacenter-scale experiments across your entire Gremlin organization simultaneously. ‍

Read Post

Gremlin

Read more about Announcing Disaster Recovery Testing

Reliability Resolutions: How to build effective reliability programs that won't fade away

Jan 21, 2026 By Gavin Cahill In Gremlin

Did you know the third week of January is the most common time for people to fail New Year’s Resolutions? It doesn’t matter whether it’s exercising more, learning a new language, or just trying to drink less coffee, that initial surge of fresh New Year’s energy is fading, and if you want to make a resolution stick, this is the key time to make a lasting change. The same is true with any reliability resolutions you might have made.

Read Post

Gremlin

Read more about Reliability Resolutions: How to build effective reliability programs that won't fade away

How to test application resiliency by simulating the Cloudflare December 2025 outage

Dec 19, 2025 By Gavin Cahill In Gremlin

This fall and winter have had their share of major outages (including AWS, Azure, and Cloudflare), and December was no exception. On December 5, 2025, Cloudflare suffered a 25-minute outage that served responses with HTTP 500 errors to about 28% of HTTP traffic served by Cloudflare. Since Cloudflare handles an average of 81 million HTTP requests per second, this represents a substantial chunk of internet traffic, including LinkedIn, Zoom, and Downdetector.

Read Post

Gremlin

Read more about How to test application resiliency by simulating the Cloudflare December 2025 outage

Release Roundup 2025: Reliability across AI, on-prem, and applications

Dec 15, 2025 By Andre Newman In Gremlin

2025 was a stark reminder of why reliability is so critical in the tech sector. The year wrapped up with multiple high-profile outages across several major cloud providers, costing companies around the world billions of dollars. Building resilient systems has never been more of a priority, especially as we move into the era of agentic AI.

Read Post

Gremlin

Read more about Release Roundup 2025: Reliability across AI, on-prem, and applications

How to use Gremlin's Reliability Report

Dec 12, 2025 By Gavin Cahill In Gremlin

Modern applications can easily include hundreds of discrete services, all of which need to be reliable in order for the application to function correctly. While running tests on a handful of critical services can lead to small reliability improvements, real impact requires testing and increased reliability visibility across your entire organization. That’s the logic behind the new, improved Reliability Reports within Gremlin.

Read Post

Gremlin

Read more about How to use Gremlin's Reliability Report

Reliability lessons from the 2025 Cloudflare outage

Nov 20, 2025 By Andre Newman In Gremlin

On November 18, 2025, X, ChatGPT, Shopify, and many other major sites went offline simultaneously. Even Downdetector, Ookla’s popular outage tracking website, briefly went offline. What caused this issue? Why were so many major websites affected by it? And what steps can you take to reduce the impact on your own applications? ‍

Read Post

Gremlin

Read more about Reliability lessons from the 2025 Cloudflare outage

Operations | Monitoring | ITSM | DevOps | Cloud

Creating an agentic feedback loop with reliability guardrails

Why agentic AI development needs reliability guardrails

The hidden reliability risks in your agentic AI workflows

How Gremlin makes disaster recovery testing easier and faster

Announcing Disaster Recovery Testing

Reliability Resolutions: How to build effective reliability programs that won't fade away

How to test application resiliency by simulating the Cloudflare December 2025 outage

Release Roundup 2025: Reliability across AI, on-prem, and applications

How to use Gremlin's Reliability Report

Reliability lessons from the 2025 Cloudflare outage

Monthly Archive

Follow Us