%term

How to use Gremlin's Reliability Report

Dec 12, 2025 By Gavin Cahill In Gremlin

Modern applications can easily include hundreds of discrete services, all of which need to be reliable in order for the application to function correctly. While running tests on a handful of critical services can lead to small reliability improvements, real impact requires testing and increased reliability visibility across your entire organization. That’s the logic behind the new, improved Reliability Reports within Gremlin.

Read Post

Gremlin

Read more about How to use Gremlin's Reliability Report

Reliability testing creates important conversations

Dec 9, 2025 By Gremlin In Gremlin

Without reliability tests, you’re left in the dark about how your system will react to failures. It leaves you open to outages and prevents vital conversations that will help improve your reliability.

View Video

Gremlin

Read more about Reliability testing creates important conversations

LLMs need a different approach to reliability

Dec 4, 2025 By Gremlin In Gremlin

In this webinar clip, Alex Nauda, CTO of Nobl9, explains how LLMs are non-deterministic, which means you need to shift how you monitor the reliability of your AI systems.

View Video

Gremlin

Read more about LLMs need a different approach to reliability

Reliability lessons from the 2025 Cloudflare outage

Nov 20, 2025 By Andre Newman In Gremlin

On November 18, 2025, X, ChatGPT, Shopify, and many other major sites went offline simultaneously. Even Downdetector, Ookla’s popular outage tracking website, briefly went offline. What caused this issue? Why were so many major websites affected by it? And what steps can you take to reduce the impact on your own applications? ‍

Read Post

Gremlin

Read more about Reliability lessons from the 2025 Cloudflare outage

Reliability is trust in your systems and your team

Nov 20, 2025 By Gremlin In Gremlin

Reliable systems mean you can trust your systems, but also your team.

View Video

Gremlin

Read more about Reliability is trust in your systems and your team

Chaos Engineering: Be thoughtful about failure

Nov 18, 2025 By Gremlin In Gremlin

Chaos Engineering isn’t about breaking things. It’s about being thoughtful about failure to learn something. Gremlin founder and CEO Kolton Andrus lays it out in this clip from when he sat down with Stephen Townshend on the Slight Reliability podcast!

View Video

Gremlin

Read more about Chaos Engineering: Be thoughtful about failure

Reliability lessons from the 2025 Microsoft Azure Front Door outage

Nov 17, 2025 By Gavin Cahill In Gremlin

On October 29th, 2025, Azure Front Door suffered an outage that impacted Microsoft services on a global level, including Microsoft 365, Outlook, Xbox Live, Copilot, and more. It also affected Microsoft Azure, meaning companies like Costco, Starbucks, and Alaska Airlines ran into issues for both customer-facing and internal systems. The root of the issue was a misconfiguration in the data plane for Azure Front Door and the Azure Content Delivery Network.

Read Post

Gremlin

Read more about Reliability lessons from the 2025 Microsoft Azure Front Door outage

Improve Kubernetes reliability faster with Gremlin and Dynatrace

Nov 10, 2025 By Gavin Cahill In Gremlin

It’s now easier than ever to start testing Kubernetes with Dynatrace and Gremlin. With a new strategic integration, Kubernetes services set up in Dynatrace are automatically discovered in Gremlin to make testing set up simple and fast. At a time when AI is driving massive expansions in infrastructure and dramatically increasing deployment speed, being able to set up and test new services quickly is more important than ever. ‍

Read Post

Gremlin

Read more about Improve Kubernetes reliability faster with Gremlin and Dynatrace

Reliability lessons from the 2025 AWS DynamoDB outage

Nov 7, 2025 By Gavin Cahill In Gremlin

On October 19th and 20th, 2025, the AWS region US-EAST-1 suffered a massive outage. What started with a 3-hour Amazon DynamoDB outage from a DNS issue led to an Amazon EC2 outage that lasted an additional 12 hours before normal service was restored. Over the course of the outage, there were over 17 million outage reports as companies like Snapchat, Roblox, Amazon, Reddit, Venmo, and more were impacted.

Read Post