Operations | Monitoring | ITSM | DevOps | Cloud

The API tests passed. The database didn't.

We shipped v2 of a small products API on a Thursday. Green CI. Green replay. The new search endpoint worked. I went home feeling competent. Friday morning I ran the same traffic against both builds with proxymock and compared the SQL. v2 had added 80 queries on the same HTTP script. A per-product audit COUNT was firing inside the list handler. A startup migration had run ALTER TABLE and CREATE TABLE audit_log. Total DB time was up 70 ms on a demo that should have been boring.

Trace without traces

A customer emailed on a Tuesday: checkout hung for ten seconds. I opened our tracing tool, punched in the time window, and got nothing. The trace was sampled out. We keep 1% of traces, like most shops with real traffic do. The one request that actually mattered was in the 99% we threw away. I spent twenty minutes admiring our observability stack before admitting it couldn’t answer a first-grader’s question: what happened to this person? Here’s what I know now.

The Three Pillars Were Built for Humans

It was 2am and I was paying for the privilege. Something was on fire in production, and I’d done the modern thing: I pointed an AI agent at it. It ingested the dashboards. It read the logs. It walked the traces. Then it handed me back a beautifully formatted paragraph that said, in effect, “latency is elevated on the checkout path.” I knew that. The page told me that.

Which Bugs AI Agents Fix Better With Traffic

In the first experiment, I wanted a baseline: if an AI coding agent gets the same production signal a human would get, can it fix bugs in a codebase it has never seen? Yes, but only when I gave it better context. With only an alert, the agent passed 51% of the runtime tests. When I added captured traffic, the actual request and response for the failing call, it climbed to 77%. This post is the second pass.
Sponsored Post

Five things your logs will never tell you

A customer escalation hit my queue when I was on the customer smoke jumpers team at an observability vendor. My team was the group that parachutes into Fortune 500 accounts one bad week from churning and usually after a big customer outage. The customer had filed a billing dispute three weeks earlier and their on-call engineers were stuck. They had our full stack: logs, metrics, traces, end-to-end instrumentation, every product we sold and some we didn't. They could see the request came in. They could see it returned a 500. They could not see the body. The trace was sampled out. The log line was truncated at 4KB.

Beware of PII in Testing Data: The Security Iceberg and Where PII Actually Hides

If you run a platform tools or security team, you have likely heard this request from developers: “I just need a copy of the production database for staging so I can run realistic load and integration tests.” It is a completely reasonable request. Production traffic and data contain the actual request shapes, real-world value distributions, long-tail anomalies, and timing patterns that make tests useful.

Build WireMock mappings fast from real traffic

I’m a big fan of service mocking. I’ve been working in and around software for about 25 years, and one thing never changes: when you sit down to work on your code, you almost never have everything available. The database, the third-party API, the message queue, the service two teams over. Something’s missing. So you’ve got to stub it out or mock it out and keep moving.

Capture once, test forever

We’ve gotten used to understanding our applications through signals, summaries, and traces. Tiny little bits of information about how the app really works. Not because that’s the best way to do it, but because it’s been too hard to get the real thing. The real information exists. It’s on the network. How people called your app and what your code did. What other systems it called, the database queries it made, and the result sets that came back.

Fixing 403 auth errors when you replay traffic

Trigger warning: this one is about Java, authentication, and Docker Compose files. If that is not your thing, I am sorry, but they are part of life and they are honestly not that hard to work with. Everything here is open source on our GitHub repo, so you can follow along. Recording an authenticated Java flow, replaying it, hitting the dreaded 403, and fixing it with a proxymock recommendation.

We won't train on your data is not a security architecture

Every enterprise contract I’ve signed in the last two years has the same clause. “Vendor will not use Customer Data to train machine learning models.” Sometimes it’s a paragraph. Sometimes it’s a whole section. The language varies but the intent is identical: don’t feed our production data into your AI. I get it. I sign the same clause as a vendor. But here’s what’s been bothering me: that clause is a promise, not an architecture.