AI reliability needs system reliability

Gremlin

Jan 22, 2026

AI operates on the same systems and infrastructure as every application, which means if you want to keep it reliable, you have to keep the systems underneath it reliable. Gremlin CEO Kolton Andrus explains more in this clip from an AI reliability roundtable with @nobl9inc and @Pagerduty.

FULL TRANSCRIPT:

There's two flavors. There's: are we using AI to help us operate it, and are we just operating the bits that are running on the box? And I think good news from a we're operating the bits on the box point of view: computers are mostly the same. CPU, memory, disc I/O, network.

Those are the things that can go wrong. Calling out to a publicly hosted model, you know, that's the public model? That's just another dependency.

So what happens when it gets slow? What happens when it fails? What happens when it doesn't respond? I think we know how to handle those. Circuit breakers and fallbacks and back offs that allow us to be intelligent about it. I think for our own internal hosted models, what is the traffic pattern, what does the load look like? What happens when customers do odd things, and I think there's probably more we need to learn on that front. That's really where when you've deployed it into production, you're gonna learn a whole new set of lessons about how it actually behaves when things are happening.

So I guess my tl;dr is, yeah, the bits are roughly the same. We're talking over the same channels, that gives us the same rough shape. So from that side, I don't think we should be too daunted.

AI reliability needs system reliability

Monthly Archive

Follow Us