Operations | Monitoring | ITSM | DevOps | Cloud

What to expect in a Gremlin workshop

Gremlin workshops give your team hands-on training with Gremlin so they can get real results and dramatically improve your reliability. Full transcript:  The goal of our workshops is really to accelerate you and the team in your reliability journey. Whether you're starting out for the first time, or you're a more advanced user, this workshop is really designed for you to take you to the next level.

Lessons from Alaska's outage: Redundant resilient

Last Sunday, Alaska Airlines suffered a three-hour outage that led to more than 200 flight cancellations and disrupted 15,600 passengers. The culprit? “A critical piece of multi-redundant hardware at our data centers, manufactured by a third-party, experienced an unexpected failure. When that happened, it impacted several of our key systems that enable us to run various operations, necessitating the implementation of a ground stop to keep aircraft in position.”

Measure your reliability risk, not your engineers

Do you know the current reliability risk of your systems? Do you know right now how your services will react to common failures like a dependency going down? Sadly, most organizations don’t have answers to these questions, relying on QA tests and the skill of their engineers to deploy code they assume won’t break. But this is a process problem, which means you can’t hire your way out of it.

Reliability is about more than uptime

Reliability results are more than whether your application is up, it's about proactive measurement and keeping it up. Full transcript:  Reliability results in my earlier career was, "Is there any downtime? Are there any errors that are getting thrown?" It's not a proactive way to measure your reliability. If you're measuring it in time of production, it's not gonna be an accurate reflection of what your reliability is. The way that my mindset has changed over time has been a proactive measurement. Before we ship something out, is this gonna be reliable from the start?

How to ensure your AWS workloads are resilient

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Cloud providers like AWS give you plenty of tools to make your workloads more resilient, but it’s up to you to apply them. However, considering how complex some of these tools are, where do you start? And how can you be sure your systems are more reliable as a result?

Reliability isn't a metric, it's a mindset

As someone with Type 1 diabetes, reliability is a way of life for Nick Mason, Sr. Solutions Architect at Gremlin. Full transcript: Reliability isn't just a metric, to me, it's a mindset. As someone that works in site reliability engineering and also someone who lives with type one diabetes, the concept of reliability is deeply personal to me. In tech, reliability means building systems that are going to recover gracefully and in life with a chronic condition like diabetes, it's the same thing.

Reliability means being there right when your customer needs you

When your systems are reliable, it means your customers can count on your applications to be there for them. Full transcript:  To me reliability means a good night's sleep, and being able to confidently go to bed and wake up the next day feeling ready to get out there and do my best work and not worry about the experience that our customers might have had through the night.

4 Chaos Engineering recommendations from Gartner

Gartner recently published their annual Hype Cycle reports, including the Hype Cycle for Infrastructure Platforms. Designed to help heads of infrastructure and IT operations make informed decisions about infrastructure platforms, it includes over thirty different topics covering everything from platform engineering to distributed cloud to policy as code—including Chaos Engineering and Site Reliability Engineering.

Why we're talking to people about reliability

Reliability means a lot of things to a lot of people, but it’s also essential for every digital business. That’s why we’re talking to reliability experts from all over to find out what reliability means to them and how you can improve it. Transcript:  You know, we're all out here building and operating digital businesses and like nobody's talking about reliability enough. We gotta talk about it. I can't stop talking about it and I've been on call for like 20 years.

Insights to keep AI applications reliable

AI has become a massive investment for companies. Engineering teams across industries are integrating AI into their products, whether it’s through homegrown, self-managed models or third-party model integrations. But no matter how much AI shifts the user experience, it’s still an application, which means your engineering team still needs to operate it and keep it reliable. At the same time, AI applications add complexity and complications that require a shift in your approach.