Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Strategies for migrating to Kubernetes

Migrating to a new platform can often feel like navigating a maze of technical challenges, especially when the platform is as complex as Kubernetes. Kubernetes has a vast number of features designed to help with deploying and managing large applications, but learning how to use it effectively can be just as challenging as‌ moving your workloads over. This doesn’t mean it’s impossible, of course, and there are several strategies for easing this process.

How reliability differs between monolithic and microservice-based architectures

Microservices have forever changed the way we build applications. Tools like Docker and Kubernetes made microservice-based architectures widely accessible to software developers, and cloud platforms like Amazon EKS made deploying containers fast and inexpensive. They've also enabled even small engineering teams to deploy code faster, leverage fault tolerance and redundancy, scale more efficiently, and take full ownership of their services from development all the way into production.

How to build zone-redundant cloud instances and clusters

Redundancy is a core tenet of cloud computing. While major cloud platforms have high targets for reliability, they can still fail, and it’s important for teams to have a plan for when they do. But how can you build services that can withstand something as disruptive as a datacenter outage? In this blog, we’ll show you how to prepare for availability zone outages by proactively detecting services operating in a single zone.

Five ways Gremlin helps organizations meet DORA requirements

Enacted by the European Union, the Digital Operational Resilience Act (DORA) establishes new standards for digital operational resilience in the financial sector. DORA changes the financial sector's approach to digital security and resilience by imposing stringent Information and Communication Technology (ICT) risk management, incident reporting, third-party risk management, and regular testing.

Three roles you need for reliability success

It’s one thing to say that reliability is a priority for your organization, and a whole other thing to make actual, demonstrable improvements in the availability of your applications. Sadly, it’s common for organizations to invest time, money, and effort into improving reliability only to barely nudge the needle on incidents and downtime. But there are hundreds of companies successfully improving their reliability posture—and doing it at enterprise scale.

How to build reliable services with unreliable dependencies

In an earlier blog, we looked at slow dependencies and how they can impact the reliability of other services. While we explored what happens when dependencies are degraded, what happens when dependencies outright fail? What can you do when your application or service sends a request to another service, and nothing comes back? We’ll answer this question by using Gremlin to proactively test a service with multiple dependencies.

How to make your services resilient to slow dependencies

When discussing reliability, we tend to focus on the things that we have control over: applications, virtual machine instances, deployment patterns, etc. But this ignores a significant and ever-growing part of nearly all modern software: dependencies. Dependencies are services that provide extra functionality for other services and applications. For instance, many websites depend on databases, caches, payment processors, and similar services in order to function.

Hitting reliability goals in the face of layoffs

It’s never easy when layoffs hit your organization. In addition to the personal impact of losing friends and coworkers from your team, those who remain are left trying to achieve the same business goals with less people and resources. Unfortunately, layoffs and restructuring have become a common part of business. But you’re not alone. Your partners (including Gremlin) are here to help you navigate your new reality.

How to ensure your Kubernetes Pods and containers can restart automatically

As complex as Kubernetes is, much of it can be distilled to one simple question: how do we keep containers available for as long as possible? All of the various utilities, features, platform integrations, and observability tools surrounding Kubernetes tend to serve this one goal. Unfortunately, this also means there’s a lot of complexity and confusion surrounding this topic. After all, most people would agree that availability is important, but how exactly do you go about achieving it?

How to ensure your Kubernetes cluster can tolerate lost nodes

Redundancy is a core strength of Kubernetes. Whenever a component fails, such as a Pod or deployment, Kubernetes can usually automatically detect and replace it without any human intervention. This saves DevOps teams a ton of time and lets them focus on developing and deploying applications, rather than managing infrastructure.