San Jose, CA, USA
Jul 14, 2022   |  By Andre Newman
More and more teams are moving away from monolithic applications and towards microservice-based architectures. As part of this transition, development teams are taking more direct ownership over their applications, including their deployment and operation in production. A major challenge these teams face isn't in getting their code into production (we have containers to thank for that), but in making sure their services are reliable.
Jul 8, 2022   |  By Gremlin
Chaos Engineering, where engineers intentionally inject failure to test the reliability of their systems, is becoming a regular practice for companies who value uptime and availability. As cloud-based systems have grown more complex, Chaos Engineering has become a critical part of the software testing and release process to uncover surprise dependencies, fix problems before they become 3am outages, and bake reliability into every feature.
Jun 14, 2022   |  By Jason Yee
In this episode, Jason chats with Aaron Clark, Director of Developer Advocacy at the Royal Bank of Canada. Aaron shares what it was like starting out as a developer at RBC and working in early cloud development, and then transitioning to his role as a developer advocate. Jason and Aaron talk about the value applying open source principles within organizations, or “innersource.” Their time ends with a discussion on continuing education and how to keep learning.
Jun 6, 2022   |  By Andre Newman
In their report titled “IT Resilience — 7 Tips for Improving Reliability, Tolerability and Disaster Recovery”, Gartner presents seven strategies for improving the resilience posture of your critical systems. These recommendations range from how to get started, to identifying IT hazards and risks to reliability, to capturing metrics and translating them into business value. In this blog, we’ll take a high-level look at the report and summarize some of its key findings.
May 31, 2022   |  By Jason Yee
In this episode, we chat with Cisco’s head of developer content, community, and events, Michael Chenetz. We discuss everything from KubeCon to kindness and Legos! Michael delves into some of the main themes he heard from creators at KubeCon, and we discuss methods for increasing adoption of new concepts in your organization. We have a conversation about attending live conferences, COVID protocol, and COVID shaming, and then we talk about how Legos can be used in talks to demonstrate concepts.
May 17, 2022   |  By Jason Yee
It’s time to shoot for the stars with Dan Isla, VP of Product at itopia, to talk about everything from astronomical importance of reliability to time zones on Mars. Dan’s trajectory has been a propulsion of jobs bordering on the science fiction, with a history at NASA, modernizing cloud computing for them, and loads more. Dan discusses the finite room for risk and failure in space travel with an anecdote from his work on Curiosity.
May 10, 2022   |  By Sydney Lesser
You might be familiar with GameDays at this point. From watching our Introduction to GameDay webinar, viewing our Demo video, and reading our tutorial, you’ve probably learned that GameDays were created with the goal of increasing reliability by purposely creating major failures on a regular basis. Better yet, perhaps your own team has run a GameDay and learned something new about their services’ behavior during failure scenarios.
May 5, 2022   |  By Andre Newman
Prioritizing reliability can be challenging for businesses. Although reliable systems and services are necessary for building customer trust and growing revenue, businesses also need to focus on initiatives such as developing new products and features. When determining which initiatives to prioritize and which ones to defer, it's understandable for business leaders to choose those that provide an obvious return.
May 3, 2022   |  By Julie Gunderson
Natalie Conklin, tamer of chaos and Head of Engineering here at Gremlin, joins us to talk about embracing change, working alongside each other, and building more reliable systems. Natalie has a talk coming up at DevOpsDays Boise which she has titled “Embracing Change Fearlessly.” Her talk is oriented around enabling teams to take calculated risks and having the guts to take those risks. Natalie spent time working in India, which helped solidify her “fearlessly” philosophy.
Apr 19, 2022   |  By Jason Yee
For this episode we’re continuing to “Build Things on Purpose” with JJ Tang, co-founder of Rootly, who joins us to talk about incident response, the tool he’s built, and his many lessons learned from incidents. Rootly is aiming to automate some of the more tedious work around incidents, and keeping that consistency. JJ chats about why he and his co-founder built Rootly, and the problems they’re trying to fix and eliminate when it comes to reliability.
May 10, 2022   |  By Gremlin
Learn all about Gremlin's GameDay feature in this webinar presented by Sydney Lesser and Andre Newman. GameDays are organized team events to proactively improve reliability using Chaos Engineering principles. Gremlin makes it easier than ever to prepare, execute, and learn from them. Increase your system’s reliability with safe, secure, and simple GameDays.
May 10, 2022   |  By Gremlin
Learn how to run a GameDay in Gremlin. This video walks you through creating a GameDay, adding and running Scenarios, recording your observations, and linking to Jira in the Gremlin web app.
Apr 20, 2022   |  By Gremlin
In this episode Julie and Jason share updates on the Atlassian outage, a new incident at Cerner, and problems at the IRS. They also cover post-incident investigations from Cloudflare and Datadog.
Apr 13, 2022   |  By Gremlin
In this episode, Julie and Jason cover recent outages of the Dutch NS trains, American Express, and the on-going, long-running incident at Atlassian. In positive news, they cover the acquisitions of Puppet by Perforce and Chaos Native by Harness, and Grafana Lab's series D funding.
Mar 30, 2022   |  By Gremlin
In this episode, Jason is joined by special guest Mandi Walls, DevOps Advocate at PagerDuty. They discuss certificate-related reliability issues, updates to Github's ongoing MySQL incidents, Log4j problems, and Pokemon.
Mar 23, 2022   |  By Gremlin
In this episode, we chat about Github's recent outage and dive into the incident report from their previous outage in February. We also discuss the latest NPM controversy regarding open source, politics, and protests. Our final segment covers an update to a new piece that we featured in our very first episode.
Mar 29, 2022   |  By Gremlin
Gremlin provides a variety of ways to test the resilience of your systems, which we call "attacks". Running different attacks lets you uncover unexpected behaviors, validate resilience mechanisms, and improve the overall reliability of your systems and services. This ebook explains each of Gremlin's attacks in complete detail, including what each attack does, how it impacts your systems, and the technical and business objectives the attack helps solve.
Jul 25, 2020   |  By Gremlin
Learn the basics of Chaos Engineering: discover the tools, tests, and culture needed to create better software and prevent outages and downtime. This whitepaper provides a comprehensive introduction to the discipline of Chaos Engineering including why it is more needed than ever, how to get started, and best practices to maximize learnings and reduce risk.
Jul 25, 2020   |  By Gremlin
By following this guide, you'll successfully increase your organization's reliability with minimal effort and risk. This document will serve as your guide to implementing Chaos Engineering and Gremlin within your organization. From educating your team on the principles of Chaos Engineering to running automated experiments, this guide will walk through each stage of the adoption process in order to ensure a smooth and successful rollout.
Jul 25, 2020   |  By Gremlin
Amazon DynamoDB is fast, powerful, and intended for high availability. These are all valuable attributes in a data storage solution, but to be useful as advertised, it must be configured thoughtfully. Learn how to use Chaos Engineering to ensure DynamoDB performs the way you expect. In this guide, we cover: Amazon DynamoDB is one of the most popular NoSQL databases and is the data store of choice for many teams running production workloads in AWS.
Jul 1, 2020   |  By Gremlin
Win over and convince your coworkers and management to explore and adopt Chaos Engineering and Site Reliability Engineering (SRE). The playbook provides ideas and techniques that can be used to articulate the need and benefits to internal stakeholders in your organization. It also guides the initial implementation in a way that will lead to success and growth across the organization. Implementing something new like Chaos Engineering successfully is a good way to get promoted and help the organization succeed, and this guide is here to help you.
Jul 1, 2020   |  By Gremlin
MongoDB is designed for performance, scale, and high-availability. But, as with any software, you need to test your configuration to verify that it will work as advertised. Ensure that MongoDB performs the way you expect by using Chaos Engineering to test four key features. This guide includes four experiment tutorials to verify that MongoDB will perform reliably: In order to ensure you get the most out of MongoDB's rich features, including built-in data sharding and replication, it's crucial to test your configuration.

Gremlin aims to make the internet more reliable and prevent costly and reputation-damaging outages. Its failure-as-a-service platform empowers engineers to build more resilient systems through safe experimentation.

Downtime is expensive and can hurt your brand. Gremlin provides engineers with the framework to safely, securely, and easily simulate real outages with an ever-growing library of attacks. Turn failure into resilience with chaos engineering.

Build resilient infrastructure:

  • Resource Gremlins: Throttle CPU, Memory, I/O, and Disk.
  • State Gremlins: Reboot hosts, kill processes, travel in time.
  • Network Gremlins: Introduce latency, blackhole traffic, lose packets, fail DNS.

Test for application failure:

  • Test for failure in your code.
  • Fail or delay serverless functions.
  • Narrow the impact to a single user, device, or percentage of traffic.

Avoid downtime. Use Gremlin to turn failure into resilience.