Operations | Monitoring | ITSM | DevOps | Cloud

How to find Kubernetes reliability risks with Gremlin

Part of the Gremlin Office Hours series: A monthly deep dive with Gremlin experts. Most Kubernetes clusters have reliability risks lurking just below the surface. You could spend hours or even days manually finding these risks, but what if someone could find them for you? With Detected Risks, Gremlin automates the work involved in finding and tracking reliability risks across your Kubernetes clusters. Surface failed Pods, mismatched image versions, missing resource definitions, and single points of failure, all without having to run a single test.

Three key facts about serverless reliability

Serverless computing requires a significant shift in how organizations think about deploying and managing applications. No longer do Ops teams need to think about provisioning servers, installing operating system patches, and writing shell scripts to manage deployments. While serverless takes away much of this responsibility, one aspect still needs to be handled thoughtfully: reliability. In this blog, we’ll look at three important facts about serverless reliability that teams often overlook.

Ensuring your AI systems can scale to meet demand

The amount of traffic handled by AI systems can’t be overstated. Over half of all organizations in India, the UAE, Singapore, and China use AI, and traffic from generative AI sources jumped by 1,200% since July 2024. While demand for AI-powered workloads is steadily increasing overall, traffic to individual AI providers is much more unpredictable. User demand spikes and wanes unexpectedly, but like any service, users expect you to always be available and responsive.