Operations | Monitoring | ITSM | DevOps | Cloud

4 Chaos Engineering recommendations from Gartner

Gartner recently published their annual Hype Cycle reports, including the Hype Cycle for Infrastructure Platforms. Designed to help heads of infrastructure and IT operations make informed decisions about infrastructure platforms, it includes over thirty different topics covering everything from platform engineering to distributed cloud to policy as code—including Chaos Engineering and Site Reliability Engineering.

Insights to keep AI applications reliable

AI has become a massive investment for companies. Engineering teams across industries are integrating AI into their products, whether it’s through homegrown, self-managed models or third-party model integrations. But no matter how much AI shifts the user experience, it’s still an application, which means your engineering team still needs to operate it and keep it reliable. At the same time, AI applications add complexity and complications that require a shift in your approach.

How to be prepared for cloud provider outages

GCP’s recent outage on June 12th was a reminder of just how interconnected modern architectures are. The 2 hour and 28 minute outage affected dozens of companies and spanned 80+ Google services and products. But what was really illuminating was just how far the outage spread due to hidden dependency risks. Many companies that don’t run on GCP were startled to find their services suddenly affected because they had dependencies or depended on vendors that did use GCP.

Three key facts about serverless reliability

Serverless computing requires a significant shift in how organizations think about deploying and managing applications. No longer do Ops teams need to think about provisioning servers, installing operating system patches, and writing shell scripts to manage deployments. While serverless takes away much of this responsibility, one aspect still needs to be handled thoughtfully: reliability. In this blog, we’ll look at three important facts about serverless reliability that teams often overlook.

Ensuring your AI systems can scale to meet demand

The amount of traffic handled by AI systems can’t be overstated. Over half of all organizations in India, the UAE, Singapore, and China use AI, and traffic from generative AI sources jumped by 1,200% since July 2024. While demand for AI-powered workloads is steadily increasing overall, traffic to individual AI providers is much more unpredictable. User demand spikes and wanes unexpectedly, but like any service, users expect you to always be available and responsive.

How a major retailer tested critical serverless systems with Failure Flags

Not too long ago, a customer came to us with a high-value use case. The customer, a major apparel company with retail and e-commerce applications, needed to prove that a critical service of their payment applications could failover correctly between regions in case of an outage. But there was one snag: the service was built using AWS Lambda. This meant infrastructure-focused tests would have trouble replicating the failure conditions necessary to test the failover due to Lambda’s serverless model.

Simulating artificial intelligence service outages with Gremlin

The AI (artificial intelligence) landscape wouldn’t be where it is today without AI-as-a-service (AIaaS) providers like OpenAI, AWS, and Google Cloud. These companies have made running AI models as easy as clicking a button. As a result, more applications have been able to use AI services for data analysis, content generation, media production, and much more.

Three reliability best practices when using AI agents for coding

One of the biggest causes of outages and incidents is good old-fashioned human error. Despite all of our best intentions, we can still make mistakes, like forgetting to change defaults, making small typos, or leaving conflicting timeouts in the code. It’s why 27.8% of unplanned outages are caused by someone making a change to the environment. Fortunately, reliability testing can help you catch these errors before they cause outages.

How to make your AI-as-a-Service more resilient

When you think about “AI reliability,” what comes to mind? If you’re like most people, you’re probably thinking of generative AI model accuracy, like responses from ChatGPT, Stable Diffusion, and Sora. While this is certainly important, there’s an even more fundamental type of reliability: the reliability of the infrastructure that your AI models and applications are running on. AI infrastructure is complex, distributed, and automated, making it highly susceptible to failure.