%term

Reliability is more important than ever-are you ready?

Jun 6, 2024 By Gremlin In Gremlin

Reliability and resiliency are getting more and more important. Is your organization ready? "Our digital infrastructure is going to be almost as important as our physical infrastructure. And when it fails, it's going to be a big deal. Like when a huge bank has a multi-day outage, when it impacts travel, safety, military, finance, government, those things are going to be much more important than they have been in the past.

View Video

Gremlin

Read more about Reliability is more important than ever-are you ready?

The CTO is responsible for reliability and availability

May 30, 2024 By Gremlin In Gremlin

Who's ultimately responsible for reliability? "You need an executive champion that cares about this. And to me, it's the CTO. The CTO is responsible for the quality of the code that you're writing, the quality of the customer experience, the quality of the product. And so, you know, your software doesn't work. The quality is zero. Not half points here. If you can't use it, it doesn't work.

View Video

Gremlin

Read more about The CTO is responsible for reliability and availability

How Nagarro used Gremlin to prevent a cascading failure outage

May 28, 2024 By Gremlin In Gremlin

Check out how Nagarro used Gremlin to help a client prevent a cascading failure before it caused an outage. "Once we had tested a critical software that was doing millions of online transactions on a daily basis. The design was fail safe, providing redundancy on critical services by having multiple instances deployed on different VMs. What we did was we ran a virtual machine terminate test to bring down an instance of that service with the hypothesis that it will recover automatically. Well, the service did recover automatically, but the system saw a cascading failure.

View Video

Gremlin

Read more about How Nagarro used Gremlin to prevent a cascading failure outage

Strategies for migrating to Kubernetes

May 24, 2024 By Andre Newman In Gremlin

Migrating to a new platform can often feel like navigating a maze of technical challenges, especially when the platform is as complex as Kubernetes. Kubernetes has a vast number of features designed to help with deploying and managing large applications, but learning how to use it effectively can be just as challenging as‌ moving your workloads over. This doesn’t mean it’s impossible, of course, and there are several strategies for easing this process.

Read Post

Gremlin

Read more about Strategies for migrating to Kubernetes

Amazon makes reliability a priority-do you?

May 23, 2024 By Gremlin In Gremlin

Are you making really reliability a priority? Or are you just giving it lip service? "At Amazon, I was part of the retail website. Outages were lost money, lost money was bad. So Amazon cared deeply about this. That was part of it. The other part was it was part of the engineering culture. When I arrived, one of the things I was told was, we expect you to write high quality, performant, efficient, available code. It's just everybody.

View Video

Gremlin

Read more about Amazon makes reliability a priority-do you?

Battletesting Coroot with OpenTelemetry Demo and Chaos Mesh

May 22, 2024 By Nikolay Sivko In Coroot

The most effective method for evaluating an observability tool is to introduce a failure intentionally into a fairly complex system, and then observe how quickly the tool detects the root cause. We’ve built Coroot based on the belief that having high-quality telemetry data enables us to automatically pinpoint the root causes for over 80% of outages with precision. But you don’t have to take our word for it—put it to the test yourself!

Read Post

Coroot

Read more about Battletesting Coroot with OpenTelemetry Demo and Chaos Mesh

How reliability differs between monolithic and microservice-based architectures

May 14, 2024 By Andre Newman In Gremlin

Microservices have forever changed the way we build applications. Tools like Docker and Kubernetes made microservice-based architectures widely accessible to software developers, and cloud platforms like Amazon EKS made deploying containers fast and inexpensive. They've also enabled even small engineering teams to deploy code faster, leverage fault tolerance and redundancy, scale more efficiently, and take full ownership of their services from development all the way into production.

Read Post

Gremlin

Read more about How reliability differs between monolithic and microservice-based architectures

How to run Chaos Engineering experiments in your CI/CD pipeline

May 10, 2024 By Gremlin In Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Ad-hoc Chaos Engineering experiments are great for learning more about how your systems work, but they don’t tell you how your systems behave over time. As new features get deployed, environments change, and regressions get introduced, even the most resilient systems can gain reliability risks. QA and performance testing are already built into CI/CD - why not reliability?

View Video

Gremlin

Read more about How to run Chaos Engineering experiments in your CI/CD pipeline

How to build zone-redundant cloud instances and clusters

May 9, 2024 By Andre Newman In Gremlin

Redundancy is a core tenet of cloud computing. While major cloud platforms have high targets for reliability, they can still fail, and it’s important for teams to have a plan for when they do. But how can you build services that can withstand something as disruptive as a datacenter outage? In this blog, we’ll show you how to prepare for availability zone outages by proactively detecting services operating in a single zone.

Read Post

Gremlin

Read more about How to build zone-redundant cloud instances and clusters

Five ways Gremlin helps organizations meet DORA requirements

May 7, 2024 By Ryan Detwiller In Gremlin

Enacted by the European Union, the Digital Operational Resilience Act (DORA) establishes new standards for digital operational resilience in the financial sector. DORA changes the financial sector's approach to digital security and resilience by imposing stringent Information and Communication Technology (ICT) risk management, incident reporting, third-party risk management, and regular testing.

Read Post