Operations | Monitoring | ITSM | DevOps | Cloud

Test for the common failures that cause 80% of outages with Gremlin

80% of failures at the infrastructure layer come from the same core gaps in reliability. Jeff Nickoloff, Gremlin Principal Engineer, goes over how Reliability Management test suites help improve reliability across your organization. Are you waiting for the other reliability shoe to drop and hoping that you actually fixed core resilience issues? Or do you know for sure that you’re resilient to common reliability issues?

Now in private beta: Gremlin Service Mesh Extension

Service meshes like Istio have become an essential way to securely and reliably distribute network traffic, especially with ephemeral, service-based architectures such as Kubernetes. However, their constantly shifting nature can interfere with targeting specific services for resilience tests. Infrastructure-based testing is designed to target specific IP addresses, allowing precision testing of applications, VMs, and nodes.

Release Roundup November 2024: Reliability in the serverless and AI era

2024 is coming to a close, and while many teams are slowing down in preparation for the holidays, we’ve been cooking up tons of new features. We’ve extended our platform support to the Istio service mesh, added a brand new experiment type for testing artificial intelligence (AI) and large language model (LLM) workloads, and made it easier to onboard Kubernetes clusters. We’ve also made our Linux and Windows agents more robust and performant.

Reliable AI models, simulations, and more with Gremlin's GPU experiment

Note This blog uses “GPU” to refer to the entire processing circuit, including the GPU processor, video memory, and other supporting hardware. ‍ Artificial Intelligence (AI) has become one of the biggest tech trends in years. From generating full movies to updating its own code, AI is performing tasks that were once science fiction.

Integrating Gremlin with your observability tools

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. To get the most value out of Chaos Engineering and reliability testing, you need a way to observe your service’s behavior. Observability tools offer insight into how your systems are performing, but observability on its own isn’t enough. You need a way to monitor your systems while testing their reliability so you can determine whether your service passed or failed a test.

Building Resilience from Architecture to Production with AWS & Gremlin

Unreliable software can have a painful impact on your customers and your business—something we’ve all seen and felt during high-profile outages. And while building on the cloud with AWS unlocks improved scaling and reliability capabilities, the complexity of modern distributed systems can potentially introduce outage-causing reliability risks. How can you be sure your systems are resilient to failure when they’re based on complex architecture, built by hundreds of teams, and are being updated almost constantly?

How reliability engineering can verify disaster recovery plans

Disaster recovery plans have always been a crucial part of businesses—especially essential services like banks. These plans help keep your business up and running during a disaster or extreme scenario so you can be there for your customers when they need you the most.

Three serverless reliability risks you can solve today using Failure Flags

Serverless platforms make it incredibly easy to deploy applications. You can take raw code, push it up to a service like AWS Lambda, and have a running application in just a few seconds. The serverless platform provider assumes responsibility for hosting and operating the platform, freeing you up to focus on your application. Naturally, this raises a question: if something goes wrong, who’s responsible?

Best Practices for Testing Zone Redundancy

The way the story goes is that in the old days Amazon used to cut power to data centers so they could see if their services were actually redundant across different data centers; and that they only abandoned this practice when EC2 customers started to complain (no matter how many times they were warned their instances might disappear without notice). This story may be apocryphal, but you don’t need to be worried about power loss outages in order to have a given data center go down.
Sponsored Post

Top 7 Kubernetes Chaos Engineering Tools

Developing highly resilient Kubernetes deployments is crucial for ensuring that your hosted applications in Kubernetes can effectively manage and recover from disruptions. This capability is vital in order to maintain continuous availability for your customers. The importance of resilience in your distributed system also escalates depending on your customer base and the critical nature of your application. Even brief periods of downtime can have a significant negative impact on your business.