Operations | Monitoring | ITSM | DevOps | Cloud

Office Hours: Get better reliability on AWS with our new release

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Cloud platforms make it easier than ever to deploy massively scalable, distributed workloads, but this is a double-edged sword. There are reliability challenges unique to the cloud that didn’t exist before. Failed migrations, recurring incidents, and reliability toil take their toll.

Achieving SLO Success with Golden Signals and Reliability Testing

The four Golden Signals are an easy and effective way to measure the most important aspects of a system, and when paired with a reliability management platform like Gremlin, they help you proactively meet your SLOs so you can meet your legal obligations and deliver the perfect customer experience.

5 essential resilience tests for a successful cloud migration

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Migrating to the cloud usually means faster deployments and easier scalability, but it also means latency. Cloud applications communicate over distributed networks, and while these networks are fast, little bits of latency can quickly add up.

Are you testing for known reliability vulnerabilities?

Are you testing for known reliability vulnerabilities? "Risks have different priorities, but ultimately we want to be aware of those risks. Just like we want our security team to go scan for known vulnerabilities, our reliability team should be scanning for known vulnerabilities. And those are easy things we should go address. There's a second part of it, which is kind of just good engineering testing, which is: Hey, we have a set of test cases that we know need to pass.

Build reliability efforts into your regular engineering schedule

Improving reliability might seem daunting, but you'd be surprised how much impact you can have with a relatively light lift. "Reliability doesn't need to be everybody stopped the world for a month, kind of a tech debt thing. If we spent 20 minutes a week, we could actually save ourselves a ton of time over the course of the year. The business needs to be efficient and agile, but it's important that the reliability is there. And so we really need people to be able to react quickly, adapt, and do a little bit along the way.

How to balance reliability with other DevOps priorities

Reliability efforts do take up some bandwidth, but in the end it's worth it—as our customers find out when their outage costs go down. "Everyone has their own priorities that they're dealing with. Given unlimited time and money, absolutely everyone would want to build the best possible system that is the most secure, performant, resilient, and everything.

How to Build Resilience Throughout Your SDLC Lessons from a Top 10 Bank

Are your applications as reliable as you planned? How do you know? The only way to ensure systems are resilient to common failure conditions is to test them, yet many large enterprises struggle with the effort and expense to do so. In this webinar, Anantha Movva, a former head of SRE and Performance Engineering at one of the top 10 North American banks, will share how he drove Chaos Engineering and resilience testing adoption throughout his organization.

Software reliability and availability is the whole team's problem-not just a few engineers

Reliability is everyone's problem—not just the SRE team's. "It's not just the SRE's problem. It's everybody's problem. So the SREs, they can run point and they can help report and help us understand, but we also have to hold the teams accountable. Are the teams investing time in reliability? Are they finding and fixing issues? Are we giving them space? And I think that comes back to, does the business see the benefit and do we have a good way of quantifying the benefit to the business?"—Kolton Andrus, Gremlin CTO.

Spend a little time on software reliability now instead of a lot of time later

You're going to spend time fixing reliability—but it's your choice whether it's during an outage or ahead of time on your schedule and for less costs. Which will you choose? "We all know when things go wrong, it cost us a million dollars and it was really bad. Let's have that never happen again. But when we say, I need every engineering team to spend one hour, one day a week on reliability, does everyone lose their mind, or is that a reasonable request? Can we amortize out the cost of that?

How to run fault injection tests on AWS managed services

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Fully-managed SaaS services offer incredible scalability and accessibility, but at a cost: they’re also single points of failure. If your application depends on a SaaS service and the service fails, guess who your customers will blame? We need to design applications to anticipate and work around managed service failures, but how do we do that without having to wait for the service to fail?