Operations | Monitoring | ITSM | DevOps | Cloud

What is Runtime Context? A Practical Definition for the AI Era

TLDR: Runtime Context is live, execution-level access to a running production system. It lets engineers and AI agents ask precise questions of running code and get answers immediately, without redeploying or interrupting users. This is the new baseline for reliability.

Fleet Management and Terraform: Use cases and best practices for managing collectors in Grafana Cloud

Earlier this year we launched Grafana Cloud Fleet Management to address the pain that comes with managing scores of telemetry collectors across departments and environments. We've been excited to see how organizations are using it to manage collectors at scale, but we've also heard from users who aren't sure how Fleet Management fits with their existing infrastructure-as-code tooling. The good news is Fleet Management is designed specifically to complement—not replace—tools like Terraform.

Paginating large datasets in production: Why OFFSET fails and cursors win

The things that separate an MVP from a production-ready app are polish, final touches, and the Pareto ‘last 20%’ of work. Many of the bugs, edge cases, and performance issues will come to the surface after you launch, when the user stampede puts a serious strain on your application. If you’re reading this, you’re probably sitting on the 80% mark, ready to tackle the rest.

Time Series Meets Graph: Understanding Relationships in Streaming Data

Data systems rarely operate as isolated components. Machines depend on sensors, services rely on other services, and devices exchange data through shared gateways. When something changes, the impact often spreads beyond a single metric. To trace how changes move through complex systems, many teams turn to graph-style analysis to map dependencies and follow cause and effect.

The fragile web: 2025's lessons on uptime, reality, and engineering rigor

If you are into IT operations or leadership, you likely spent at least one weekend in 2025 huddled over a laptop while the rest of the world slept. For the last decade, our industry has pursued five nines (99.999% uptime) as the holy grail. We architected redundant systems, deployed across multiple availability zones, and optimized our code until it hummed. We convinced ourselves that if we just engineered hard enough, we could tame the chaos of the internet. We thought we could. We really did.

When AI Speeds Up Change, Knowing First Becomes the Constraint

In a recent post, I argued that AI doesn’t fix weak engineering processes; rather it amplifies them. Strong review practices, clear ownership, and solid fundamentals still matter just as much when code is AI-assisted as when it’s not. That post sparked a follow-up question in the comments that’s worth sitting with: With AI speeding things up, how do teams realise something’s gone wrong before users do? It’s the right question to ask next.

How to Monitor SaaS Status in 2026 : A Complete Guide

This is an updated and expanded version of the older guide. According to the 2025 State of SaaS report, organizations use an average of 106 SaaS apps. Staying on top of your SaaS vendors' status is as important as monitoring your own services. The Cloudflare, AWS, Azure, and Google Cloud outages in 2025 were strong reminders of this fact.