Operations | Monitoring | ITSM | DevOps | Cloud

How to run an operational excellence review for software engineering

Most engineering organizations already run something they call an operational review. It usually looks like a cousin of the quarterly business review: a deck assembled every few months, walked through team by team, anchored on whatever incidents happened to land in the previous quarter. By the time leadership sees the data, the systems it describes have moved on and the next set of risks is already accumulating in the gap.

Tapirs, Trainings, and Team Dinners: My First Kentik Meetup

Gavin joined Kentik’s People Ops team less than a year ago, so when April brought his first team offsite and his first HR conference in San Diego, it was a lot of firsts at once. He writes about meeting his colleagues face to face for the first time, what he took away from HRA 26, and his new appreciation for tapirs.

The Illusion of Control: Why Dashboards Do Not Equal SLA Protection

Modern operations teams work within a constant stream of dashboards, status summaries, and health indicators that turn complex environments into organized visual displays. Large screens show color-coded service conditions. Executive reports quantify uptime. Observability platforms map system dependencies across cloud, hybrid, and distributed architectures. This visual structure creates a sense of order. In environments defined by constant change, that sense of order can feel like control.

The Invisible IT Department: How to Deliver Friction-Free Experiences with Agentic AI

Every enterprise has bought AI, but many are still waiting for their investment to pay off. Ivanti’s 2026 AI Maturity Report found that only 2% of organizations say they currently have no AI use at all. As the majority of organizations move beyond the AI experimentation stage, the real competitive differentiator is if that AI is providing continuous, business value at scale.

So you need to add microcontrollers to your fleet: now what?

Your Ubuntu Core fleet is running beautifully. OTA updates roll out in minutes. Every device is strictly confined, cryptographically attested, and carrying a 10 to 15 year long term support (LTS) commitment. The operational team sleeps soundly. Then the product roadmap meeting happens. The industrial floor needs vibration sensors on every motor. The smart building needs temperature nodes in every room. The cold chain system requires dozens of low-power Bluetooth tags. And someone just said the words.

The Data Plane Reality: OTel Scales, While Topology UX Lags

OpenTelemetry won the architectural standards battle. At scale, though, telemetry breaks more like plumbing than code. It breaks quietly, across a graph, with a blast radius you don’t understand until it’s expensive. With over 65% of organizations now running more than 10 collectors in production, hybrid deployments across Kubernetes and VMs are accelerating fast. Telemetry standardization is no longer a project milestone. It is a baseline expectation.

Service Level Agreement (SLA) Templates: Examples, Metrics, and Best Practices

How quickly should your team resolve a critical ticket, and what are the consequences when it misses the target? That is exactly where Service Level Agreements (SLAs) come into play. An SLA turns service expectations into measurable commitments by defining clear response and resolution targets. Rather than starting from scratch, an SLA template provides a structured foundation for establishing those commitments and tracking performance against agreed standards. Why does that matter?

Agent Timeline Is Now Generally Available

A few weeks ago I wrote about a customer’s refund request that stopped halfway through at 11:47 p.m. on a Tuesday night. That post walked through the 40 minutes it took to work out what happened when an agentic application had a problem: a tool retried against a rate-limited payments API, the error responses filled up the context window, and the agent gave up. The whole reason we built Agent Timeline was to turn that 40 minutes into five. To reduce MTTR. To solve the problem and get back to sleep.