Monthly Archive

You don't have to live with outages and late nights

Aug 28, 2025 By Gremlin In Gremlin

Outages don’t have to be part of your life and engineers don’t have to burn out being a hero. Spread out your effort and build reliability without the drama. Transcript: You should be great at dealing with outages, but your customers don't care. There's no medals here. No one should have incentive to be paged. There's nothing good about being in a war room for 10 days or in the holiday season in 12 hour shifts around the clock just in case something happens.

View Video

Gremlin

Read more about You don't have to live with outages and late nights

How to get fast, easy insights with the Gremlin MCP Server

Aug 28, 2025 By Gavin Cahill In Gremlin

Chaos Engineering and reliability testing give you visibility into the actual reliability of your services by simulating real-world failure conditions. But what if you could dig into the testing and results data using AI to quickly uncover new insights? That’s the logic behind the Gremlin MCP Server. Released as part of Reliability Intelligence, the Gremlin MCP Server allows you to bring your LLM of choice to explore your Gremlin data and find opportunities to get more out of Gremlin.

Read Post

Gremlin

Read more about How to get fast, easy insights with the Gremlin MCP Server

Failover and cloud aren't enough for reliability

Aug 26, 2025 By Gremlin In Gremlin

Amin Momin of @CapgeminiGlobal talks about reliability takes dedicated effort beyond just using the cloud and setting up failover. Full transcript: There are two misconceptions about reliability. One is people only think failover is reliability. Just doing the failover, that will be enough from the reliability point of view. That's the first one. And the second one: we are deployed into the cloud, so it is the service provider's responsibility to provide the reliability.

View Video

Gremlin

Read more about Failover and cloud aren't enough for reliability

Fix issues faster with Recommended Remediations

Aug 22, 2025 By Gavin Cahill In Gremlin

You’ve successfully run a Fault Injection test and uncovered a new failure mode before it impacted customers. And the failure could have taken down your whole system if it had happened in production. Now what? Since this is a potential P1 outage, you absolutely need to address the issue, but that’s going to take some time as you dig through the service to track down the problem. Unfortunately, this is a common conflict.

Read Post

Gremlin

Read more about Fix issues faster with Recommended Remediations

True reliability takes the whole team

Aug 22, 2025 By Gremlin In Gremlin

Reliability takes the whole team working together. Full transcript: If you really want to get good at measuring your reliability, then you have to work together as a team. Once your software engineer organization has decided, "We're gonna test these applications to make sure that they have redundancy, availability, resilience." Just stick to that framework that you come up with as a team.

View Video

Gremlin

Read more about True reliability takes the whole team

Encourage the boring reliability work

Aug 20, 2025 By Gremlin In Gremlin

Proactive, regular reliability work is boring, repetitive, and EFFECTIVE. And if leadership wants the incredible results it brings, they have to encourage the right behavior.

View Video

Gremlin

Read more about Encourage the boring reliability work

Reliability upholds your promise to users

Aug 19, 2025 By Gremlin In Gremlin

Consistent systems are reliability systems according to Ganesh Seetharaman, Managing Director at @Deloitte. Full transcript: Strong reliability is demonstrated when systems consistently work as expected even during peak demand or unexpected events. When issues do happen, they are resolved quickly and transparently so users experience minimal disruption. Reliability also means data integrity. No matter how much stress the system is under, information needs to be accurate and secure.

View Video

Gremlin

Read more about Reliability upholds your promise to users

How Experiment Analysis uncovers the cause behind failures

Aug 15, 2025 By Gavin Cahill In Gremlin

Chaos Engineering has proven itself to be incredibly effective at tracking down failure modes, remediating reliability issues, and preventing risks before they happen. Unfortunately, it can also come with a steep adoption curve. In order to get the most out of Fault Injection testing, a practitioner needs to have a deep knowledge of the service, its expected behavior, and the code behind it. Ultimately, the rewards are worth the time.

Read Post

Gremlin

Read more about How Experiment Analysis uncovers the cause behind failures

Reliability is when customers aren't impacted

Aug 14, 2025 By Gremlin In Gremlin

Ultimately, a system is reliable when customers and engineers can count on it. Full transcript: When I get to hear stories like, "Hey, we just had our holiday sales event kick off and everything went smoothly and I didn't have to wake up in the middle of the night." That is really the true definition of reliability these people that are constantly hands-on keyboard in charge of making sure that people like myself and like you aren't impacted when we're going to, for example, buy a new pair of sneakers, or we're going to get some sort of limited edition release that's coming out, right?

View Video

Gremlin

Read more about Reliability is when customers aren't impacted

Reliability isn't an afterthought

Aug 12, 2025 By Gremlin In Gremlin

“Reliability must be a crucial outcome for all of the architectures.” —Anish Behanan from @CapgeminiGlobal.

View Video

Gremlin

Read more about Reliability isn't an afterthought

Introducing Reliability Intelligence

Aug 11, 2025 By Gremlin In Gremlin

Reliability Intelligence draws on Gremlin expertise with every test to show you how the test failed and recommended remediation.

View Video

Gremlin

Read more about Introducing Reliability Intelligence

Reliability Intelligence: your reliability expert

Aug 11, 2025 By Gavin Cahill In Gremlin

For the last decade, Gremlin has helped Fortune 500 organizations with critical uptime requirements proactively uncover reliability risks and prevent costly outages. We started with Chaos Engineering, then built Reliability Management to help teams standardize and scale their testing efforts. Today, we take another leap forward with the release of Reliability Intelligence. Reliability Intelligence draws on Gremlin expertise with each test to show you what happened and recommend remediation.

Read Post

Gremlin

Read more about Reliability Intelligence: your reliability expert

The riskiest thing you can do is not measure your risk

Aug 8, 2025 By Gremlin In Gremlin

Hiring good engineers is important, but it’s not enough to prevent outages. You need to measure and track your risk to get real results. Full transcript: My name's Jeff Nickoloff. I'm a principal engineer here at Gremlin. What I hear non-technical functions talk about is really they are much happier to sort of lean on their great engineers. Oh, we've got a great engineering culture. "We don't have reliability issues because we hire the best people.".

View Video

Gremlin

Read more about The riskiest thing you can do is not measure your risk

Avoid the Chaos Engineering bottleneck

Aug 6, 2025 By Gremlin In Gremlin

Chaos Engineering is great, but by itself it can create bottlenecks that limit your reliability journey. FULL TRANSCRIPT: One of the things we've learned while building Gremlin and being the first Chaos Engineering tool to market is with all the greatness that comes with this approach, we've learned some of the downfalls, some of the drawbacks. And one of those is how you scale this practice.

View Video

Gremlin

Read more about Avoid the Chaos Engineering bottleneck

Reliability is the absence of uncertainty

Aug 5, 2025 By Gremlin In Gremlin

Are your teams truly ready when they ship code? Amin Momin of @CapgeminiGlobal talks about how true reliability is the absence of uncertainty. Full transcript: Reliability to me signifies the absence of uncertainty. Whenever we go to production, we don't want anything to be unknown.

View Video

Gremlin

Read more about Reliability is the absence of uncertainty

Operations | Monitoring | ITSM | DevOps | Cloud

You don't have to live with outages and late nights

How to get fast, easy insights with the Gremlin MCP Server

Failover and cloud aren't enough for reliability

Fix issues faster with Recommended Remediations

True reliability takes the whole team

Encourage the boring reliability work

Reliability upholds your promise to users

How Experiment Analysis uncovers the cause behind failures

Reliability is when customers aren't impacted

Reliability isn't an afterthought

Introducing Reliability Intelligence

Reliability Intelligence: your reliability expert

The riskiest thing you can do is not measure your risk

Avoid the Chaos Engineering bottleneck

Reliability is the absence of uncertainty

Monthly Archive

Follow Us