Operations | Monitoring | ITSM | DevOps | Cloud

Getting started with on-call

Setting up on-call is simpler than it seems. It comes down to a few clear decisions about your team and what your service actually needs. This guide walks you through those decisions. You’ll learn who to add in your rotation, how long shifts should last, when to hand off, and what coverage makes sense for your service. By the end, you’ll know exactly how to set up your first schedule and move from ad-hoc firefighting to organized incident response.

What is Runtime Context? A Practical Definition for the AI Era

TLDR: Runtime Context is live, execution-level access to a running production system. It lets engineers and AI agents ask precise questions of running code and get answers immediately, without redeploying or interrupting users. This is the new baseline for reliability.

Fleet Management and Terraform: Use cases and best practices for managing collectors in Grafana Cloud

Earlier this year we launched Grafana Cloud Fleet Management to address the pain that comes with managing scores of telemetry collectors across departments and environments. We've been excited to see how organizations are using it to manage collectors at scale, but we've also heard from users who aren't sure how Fleet Management fits with their existing infrastructure-as-code tooling. The good news is Fleet Management is designed specifically to complement—not replace—tools like Terraform.

Paginating large datasets in production: Why OFFSET fails and cursors win

The things that separate an MVP from a production-ready app are polish, final touches, and the Pareto ‘last 20%’ of work. Many of the bugs, edge cases, and performance issues will come to the surface after you launch, when the user stampede puts a serious strain on your application. If you’re reading this, you’re probably sitting on the 80% mark, ready to tackle the rest.

Service Desk Automation Playbook To Improve KPIs and Agent Morale

Service desk leaders are being asked to do more with less. Ticket volumes keep climbing. SLAs keep tightening. Headcount rarely follows. Dashboards fill up fast, and before long, every conversation seems to start with a metric that’s in the red. Automation is pitched as the answer. But when it’s introduced only as a way to move faster or cut costs, it can backfire.

InvGate Renews SOC 2 Type II Certification

We’re pleased to announce that InvGate has recently renewed the SOC 2 Type II certification for the 2024-2025 period. This achievement shows our commitment to the industry’s best data protection and compliance practices. The SOC 2 standard was developed by the American Institute of Certified Public Accountants (AICPA). It consists of a third-party audit that evaluates how companies worldwide handle data privacy. Keep reading to learn how the renewal impacts you.

Why container security only works when the platform owns it

Container security has finally gone mainstream. When Docker announced hardened container images in late 2025, complete with minimal attack surfaces, non-root defaults, continuous CVE scanning, and automated updates, the response was enthusiastic. For teams managing their own infrastructure, this was a real step forward. Secure-by-default containers are no longer niche or expensive. They are expected.

Time Series Meets Graph: Understanding Relationships in Streaming Data

Data systems rarely operate as isolated components. Machines depend on sensors, services rely on other services, and devices exchange data through shared gateways. When something changes, the impact often spreads beyond a single metric. To trace how changes move through complex systems, many teams turn to graph-style analysis to map dependencies and follow cause and effect.