Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on DevOps, CI/CD, Automation and related technologies.

Millions of Metrics. Zero Clarity.

Millions of metrics. Zero clarity. That’s the reality many IT teams are facing today. As environments grow more complex, telemetry explodes. Millions of records generated every hour. Dozens of specialized tools for network, storage, Kubernetes, cloud, AI workloads. Each tool is good at its domain. But none of them answers the real question: Where should I focus right now? Fragmented visibility creates predictable failure modes.

Keeping it boring: the incident.io technology stack

At incident.io we run a deliberately simple technology stack. Keeping things boring has allowed us to scale from a few hundred customers to several thousand, while having only two platform engineers. In this post I'll walk through the stack, explain some of the choices we've made, and touch on the challenges we're facing as we grow.

Ghosts of Servers Past: The Bare-Metal Comeback Story

Bare-metal. Just reading that word might trigger a physical reaction for some of us. Dusty closets, old server rooms, and loud rigs that never seemed to work quite right. Remember waiting days for IT to provision a server, only to realize your ticket got lost in the shuffle? Or the classic "well, it worked on my machine" excuse right before a production push? Ah, the good old days.

What is an escalation policy? (And why every team needs one)

An escalation policy is the route an incident takes after it triggers. It lays out who gets alerted first and sets a wait time. If nobody responds, it moves the incident forward to the next person. The word “escalation” is worth pausing on. When an incident triggers and the first person doesn’t respond, the incident doesn’t sit and wait. It moves to the next person and keeps moving until someone picks it up. That forward movement is the escalation.

[Webinar] Conquering the Complexity of Self-Hosted Apps with Agentic AI SRE

Most enterprise SaaS products, like Komodor’s Autonomous AI SRE Platform, require installing a remote agent on the customer’s infrastructure, which varies significantly from one organization to another, in terms of architecture, configurations, permissions, processes, and more. This “unmanaged” model creates major blind spots, making daily operations, observability, debugging, and incident response challenging. When failures occur, limited visibility and bespoke systems make root-cause analysis slow, incomplete, or impossible.

The Ultimate Kubernetes Cost Monitoring And Management Guide

While Kubernetes enables teams to deliver more value faster, understanding and controlling Kubernetes costs remains challenging. You have disposable, replaceable compute resources constantly coming and going across a range of infrastructure types. Yet at the end of the month, you only get a billing line item for EKS costs and several EC2 instances.

Scaling Argo CD Past 50 Clusters: GitOps, Pipelines, & Governance

Is your engineering team hitting the "Argo Ceiling"? Argo CD is incredible at syncing state, but as you scale past 20, 50, or 100 clusters, the maintenance tax skyrockets. In this webinar, we break down why the "hub and spoke" model of GitOps creates isolated silos, leading to "tab fatigue," massive security blast radiuses, and the need for thousands of lines of brittle CI "glue code" just to handle basic release orchestration.

Resolve Webinar: Automating Joiner, Mover, and Leaver Workflows with Agentic Orchestration

Still managing Joiner, Mover, Leaver workflows with tickets and manual handoffs? It’s time to automate them. In this fast-paced session, we show how enterprises use agentic AI and orchestration to eliminate repetitive JML tickets, enforce policy automatically, and deliver secure access from day one. Powered by Resolve’s Agentic Resolution Fabric, AI agents coordinate knowledge, automation, and technician assist to provision, modify, and revoke access without manual triage. Faster onboarding. Zero-delay changes. Audit-ready offboarding.

Resolve's Agents of IT podcast - Ep. 13 - Ari's Secret Fortune (500)

In this episode of Agents of IT, we reflect on scale, strategy, and what it really takes to transform IT inside the world’s largest enterprises. Ari shares lessons from his work with Fortune 500 organizations, breaking down what separates automation pilots from true operational change. From navigating enterprise complexity to driving executive alignment, this conversation explores how large organizations move from reactive ticket management to intelligent, agent-driven operations.