Operations | Monitoring | ITSM | DevOps | Cloud

Latest Videos

How to test your systems for scalability and redundancy with Fault Injection

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Do you know if your services can tolerate losing a node? What about an entire availability zone? Or a region?‍ Large-scale outages aren’t unheard of. When you’re running critical services, it’s vital that those services can keep running even if an AZ or region fails. In addition to failing over, these services also need to scale quickly so traffic shifts don’t overwhelm your systems. How do you prove that a service is both scalable and redundant? The answer is with Fault Injection.

How to find Kubernetes reliability risks with Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Most Kubernetes clusters have reliability risks lurking just below the surface. You could spend hours or even days manually finding these risks, but what if someone could find them for you? With Detected Risks, Gremlin automates the work involved in finding and tracking reliability risks across your Kubernetes clusters. Surface failed Pods, mismatched image versions, missing resource definitions, and single points of failure, all without having to run a single test.

How to find and test critical dependencies with Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Pop quiz—what are all of the dependencies your services rely on? If you’re like most engineers, you probably struggled to come up with the answer. Modern applications are complex and rely on dozens (if not hundreds) of dependencies. Many teams rely on spreadsheets, but manual processes like these break down over time. What if you had a tool that found and tracked dependencies for you?

Kubernetes Reliability Risks: How to monitor for critical issues at scale

Learn how to automatically find and fix the most critical Kubernetes reliability risks in enterprise organizations. Recent research shows that nearly every organization has reliability risks in their Kubernetes clusters. Many of them are caused by simple misconfiguration, but they can have devastating consequences—including taking critical services offline. And while you could manually review every Kubernetes deployment, the speed and scale at which most organizations deploy to Kubernetes makes that impractical.

Building a Culture of Reliability: Why SREs Can't Do It Alone

Join Gremlin CTO and Founder Kolton Andrus to hear practical strategies for building a collaborative culture of reliability. High-velocity DevOps orgs and complex cloud-native architectures have made reliability harder than ever. Organizations are turning to SREs to make sure systems are reliable, but with so many stakeholders and competing priorities, many companies are still struggling to get ahead of the outages and incidents—SREs simply can't do it all by themselves.

What is Gremlin?

Today’s technology leaders are facing a reliability gap. Customers expect their apps to be fast and available. But with Devops and distributed systems driving more speed and complexity, it’s harder than ever to find and fix the reliability risks that can impact customer experience–before it’s too late. To close the Reliability gap, we need a reliability strategy. One that’s proactive, measurable, built-in and automated. We need a reliability platform.

Enterprise Chaos Engineering Certification Prep Session

Demonstrate your reliability expertise, increase your visibility, and advance your career with a Gremlin Enterprise Chaos Engineering certification. Chaos Engineering continues to grow in popularity and is rapidly becoming a job requirement for Engineering teams focused on reliability. In this webinar, Sr. Reliability Specialist Andre Newman goes over the mindset shifts, best practices, and key information you need to prep for your certification.

More Reliability, Less Firefighting: How to Build a Proactive Reliability Program

Does it feel like your team spends all its time putting out incident fires? Change the story with a proactive reliability program that actively improves reliability. Join reliability expert and engineering leader Jeff Nickoloff for a webinar that lays out the common traits for successful reliability programs so you can build more reliability and spend less time firefighting. You’ll also get a downloadable checklist worksheet to help you create and evaluate your reliability program.

How Detected Risks helps you find reliability risks in minutes-without running any tests

This video showcases Gremlin's Detected Risks feature. Detected risks are high-priority reliability concerns that Gremlin automatically identifies in an environment. These include misconfigurations, bad default values, and reliability anti-patterns. Gremlin prioritizes these risks based on severity and impact, giving instantaneous feedback on risks and action items to improve the reliability and stability of each service.