Latest Posts

4 Chaos Engineering recommendations from Gartner

Jul 11, 2025 By Gavin Cahill In Gremlin

Gartner recently published their annual Hype Cycle reports, including the Hype Cycle for Infrastructure Platforms. Designed to help heads of infrastructure and IT operations make informed decisions about infrastructure platforms, it includes over thirty different topics covering everything from platform engineering to distributed cloud to policy as code—including Chaos Engineering and Site Reliability Engineering.

Read Post

Gremlin

Read more about 4 Chaos Engineering recommendations from Gartner

Insights to keep AI applications reliable

Jun 23, 2025 By Gavin Cahill In Gremlin

AI has become a massive investment for companies. Engineering teams across industries are integrating AI into their products, whether it’s through homegrown, self-managed models or third-party model integrations. But no matter how much AI shifts the user experience, it’s still an application, which means your engineering team still needs to operate it and keep it reliable. At the same time, AI applications add complexity and complications that require a shift in your approach.

Read Post

Gremlin

Read more about Insights to keep AI applications reliable

How to be prepared for cloud provider outages

Jun 13, 2025 By Gavin Cahill In Gremlin

GCP’s recent outage on June 12th was a reminder of just how interconnected modern architectures are. The 2 hour and 28 minute outage affected dozens of companies and spanned 80+ Google services and products. But what was really illuminating was just how far the outage spread due to hidden dependency risks. Many companies that don’t run on GCP were startled to find their services suddenly affected because they had dependencies or depended on vendors that did use GCP.

Read Post

Gremlin

Read more about How to be prepared for cloud provider outages

Three key facts about serverless reliability

Apr 8, 2025 By Andre Newman In Gremlin

Serverless computing requires a significant shift in how organizations think about deploying and managing applications. No longer do Ops teams need to think about provisioning servers, installing operating system patches, and writing shell scripts to manage deployments. While serverless takes away much of this responsibility, one aspect still needs to be handled thoughtfully: reliability. In this blog, we’ll look at three important facts about serverless reliability that teams often overlook.

Read Post

Gremlin

Read more about Three key facts about serverless reliability

Ensuring your AI systems can scale to meet demand

Apr 1, 2025 By Andre Newman In Gremlin

The amount of traffic handled by AI systems can’t be overstated. Over half of all organizations in India, the UAE, Singapore, and China use AI, and traffic from generative AI sources jumped by 1,200% since July 2024. While demand for AI-powered workloads is steadily increasing overall, traffic to individual AI providers is much more unpredictable. User demand spikes and wanes unexpectedly, but like any service, users expect you to always be available and responsive.

Read Post

Gremlin

Read more about Ensuring your AI systems can scale to meet demand

Test serverless and application-level reliability with Failure Flags

Mar 13, 2025 By Gavin Cahill In Gremlin

It’s been a year and a half since Failure Flags was released. Since then, customers have used Failure Flags to run thousands of tests for applications running on serverless, container, and service meshes.

Read Post

Gremlin

Read more about Test serverless and application-level reliability with Failure Flags

How a major retailer tested critical serverless systems with Failure Flags

Mar 12, 2025 By Gavin Cahill In Gremlin

Not too long ago, a customer came to us with a high-value use case. The customer, a major apparel company with retail and e-commerce applications, needed to prove that a critical service of their payment applications could failover correctly between regions in case of an outage. But there was one snag: the service was built using AWS Lambda. This meant infrastructure-focused tests would have trouble replicating the failure conditions necessary to test the failover due to Lambda’s serverless model.

Read Post

Gremlin

Read more about How a major retailer tested critical serverless systems with Failure Flags

Simulating artificial intelligence service outages with Gremlin

Mar 6, 2025 By Andre Newman In Gremlin

The AI (artificial intelligence) landscape wouldn’t be where it is today without AI-as-a-service (AIaaS) providers like OpenAI, AWS, and Google Cloud. These companies have made running AI models as easy as clicking a button. As a result, more applications have been able to use AI services for data analysis, content generation, media production, and much more.

Read Post

Gremlin

Read more about Simulating artificial intelligence service outages with Gremlin

Three reliability best practices when using AI agents for coding

Feb 26, 2025 By Gavin Cahill In Gremlin

One of the biggest causes of outages and incidents is good old-fashioned human error. Despite all of our best intentions, we can still make mistakes, like forgetting to change defaults, making small typos, or leaving conflicting timeouts in the code. It’s why 27.8% of unplanned outages are caused by someone making a change to the environment. Fortunately, reliability testing can help you catch these errors before they cause outages.

Read Post

Gremlin

Read more about Three reliability best practices when using AI agents for coding

How to make your AI-as-a-Service more resilient

Feb 24, 2025 By Andre Newman In Gremlin

When you think about “AI reliability,” what comes to mind? If you’re like most people, you’re probably thinking of generative AI model accuracy, like responses from ChatGPT, Stable Diffusion, and Sora. While this is certainly important, there’s an even more fundamental type of reliability: the reliability of the infrastructure that your AI models and applications are running on. AI infrastructure is complex, distributed, and automated, making it highly susceptible to failure.

Read Post

Gremlin

Read more about How to make your AI-as-a-Service more resilient

Operations | Monitoring | ITSM | DevOps | Cloud

4 Chaos Engineering recommendations from Gartner

Insights to keep AI applications reliable

How to be prepared for cloud provider outages

Three key facts about serverless reliability

Ensuring your AI systems can scale to meet demand

Test serverless and application-level reliability with Failure Flags

How a major retailer tested critical serverless systems with Failure Flags

Simulating artificial intelligence service outages with Gremlin

Three reliability best practices when using AI agents for coding

How to make your AI-as-a-Service more resilient

Monthly Archive

Follow Us