Operations | Monitoring | ITSM | DevOps | Cloud

What Is Mean Time to Resolve (MTTR)? (And How to Improve It)

Every minute a network incident goes unresolved costs your company money. Lost productivity, missed SLAs, degraded user experience, and, in other cases, direct revenue loss. For IT teams and network admins, the pressure to resolve incidents fast isn't just operational, it's existential.

From Keyword Search to Ask AI: How We Upgraded AppSignal's Docs Experience

Documentation search is often the last thing devs think about, until someone posts publicly that they couldn't find a basic answer, or your support queue fills up with things that are genuinely in the docs. We decided to get ahead of that. This is the story of how we went from a minimal keyword-only search on our docs to a conversational Ask AI experience.

Fixing Broken Traces in GCP Cloud Run: A Custom OpenTelemetry Propagator

GCP's load balancer silently rewrites your traceparent header, orphaning spans in any OTLP backend. Here's the custom propagator that fixes it. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

The Hidden Cost of DIY DevOps: Why Growing Companies Bring in the Experts

Companies are scaling faster than ever, but infrastructure rarely keeps up with the product. When developers take on operational work on top of everything else, it feels like a smart way to cut costs. In practice, it's one of the most expensive mistakes a growing software team can make. This article breaks down what DIY DevOps actually costs and how a structured approach changes the equation.

Top tips: When leaders leave, here's how to keep your IT systems stable

Top Tips is a weekly column where we look at what’s shaping the tech world and share practical ways teams can stay prepared for what’s next. This week, we’re focusing on a situation many teams underestimate—what happens to your IT systems when a key leader steps away, and how you can build stability that doesn’t rely on any one person. Some problems don’t show up when things are running smoothly. They show up when someone leaves.

Beyond Uptime: Building a Self-Healing OpenClaw Observability Stack

The allure of OpenClaw is undeniable. You deploy a highly autonomous, self-hosted AI agent, give it access to your repositories and inboxes, and watch it reason through complex workflows while you sleep. It is the dream of the ultimate 10x developer tool realized. But as any veteran DevOps engineer will tell you: running an LLM-backed Node.js agent in production is vastly different from testing it on your local machine.

The product signal latency gap slowing your growth

Organizations often call product managers the CEOs of the product. But PMs know that’s a myth. When a CEO wants a status report, they get one immediately. They don’t need to negotiate for engineering time, reconcile conflicting project priorities, or wait for a data scientist to find a gap in their schedule. For most PMs, simply understanding the state of the product is where growth can stall.

Test network paths with TCP, UDP, and ICMP in Datadog

When developers and SREs design application tests, they often prioritize user workflows and API availability. Extending that suite with network tests that match your app’s traffic protocols can reveal whether issues originate in the network or application layer. In this post, we’ll explore how you can design effective network tests using the Transmission Control Protocol (TCP), User Datagram Protocol (UDP), or Internet Control Message Protocol (ICMP), including.

Announcing Icinga 2.16.0 and 2.15.3

We are happy to announce the release of two new versions of Icinga 2 today, 2.16.0 and 2.15.3. The first one includes some new features highlighted below, as well as a number of bug fixes and other improvements. The latter one is a small bug fix release that brings some of the other fixes included in 2.16.0 to the 2.15.x branch as well.

What Is Wrong With PaaS Today?

In the wake of 2010s, PaaS felt like magic. You focused on the code, and the platform did the rest. You could ship a production app without knowing anything about networking or, heck, even what a load balancer is. Heroku in particular made deployment a lost thought, especially for early-stage companies. That era is somewhat over, not because platforms got worse overnight, but because the assumptions underneath them quietly stopped being true.