Operations | Monitoring | ITSM | DevOps | Cloud

Part 1: Building a Production-Grade Traffic Capture and Replay System

A few years ago I was on call during the Super Bowl. At the time I was working for an observability vendor and one of our customers had an outage caused by a surge in user traffic. But our monitoring system didn’t have enough data to know what went wrong and I sat on a call for 2 hours painfully listening to them spinning up more servers and trying to catch up with the user load.

Mitmproxy vs Proxymock: Replaying Traffic for Realistic API Testing

Replaying traffic is a core tool in your toolbox when you need to reproduce a tricky bug or validate how your app behaves. Traffic replay is especially valuable for testing complex software applications that rely on APIs and microservices, where integration and functionality must be thoroughly validated.
Sponsored Post

Hidden Cost of Siloed Monitoring Tools

In today's complex IT environments, organizations often rely on a patchwork of specialized monitoring tools. One platform might monitor databases, another cloud workloads, a third enterprise applications, and yet another the infrastructure itself. While each tool addresses a specific need, this fragmented approach introduces hidden costs that can undermine operational efficiency, inflate budgets, and slow response times when critical incidents occur.
Sponsored Post

47 Day Certificates Make Premium SSL Worthless

Your enterprise just paid $500 for an SSL certificate. You know what it does that a free one doesn't? Nothing. Absolutely nothing. And the 47 day certificate mandate hits, you'll pay that $500 to touch that cert eight times a year, per certificate. For the same encryption, same trust, same green padlock that Let's Encrypt gives away for free.

Incident Communication in Higher Education: How StatusHub helps to Reduce Confusion, Tickets, and Downtime

Watch a demo recorded during EDUCAUSE Demo Day: E25 Emerging Tech Preview. And learn how higher-ed institutions can improve incident communication, reduce support tickets, and deliver faster, clearer updates with StatusHub.

AI Agent for Incident Resolution: Combining Intelligence with Autonomous Actions

Incident management is a high-stakes function. IT operations teams and SRE teams may play different roles, but when a priority incident surfaces, it is often all-hands-on-deck to ensure it is resolved in minimal time. That’s because of the high impact of incidents-if not resolved in time, they can cascade and impact other IT systems, leading to downtime, business disruptions, monetary losses, and impacting brand value, compliance, and regulatory rules.

Network Monitoring for Data Centers

Kentik NMS (Network Monitoring System), part of the Kentik Network Intelligence Platform, brings true visibility and context to network operations. See how device metrics, traffic data, and application insights come together to eliminate blind spots—so your critical workloads, like AI training and inference, run smoothly and reliably.