Operations | Monitoring | ITSM | DevOps | Cloud

Why Your Agentic Workflow Succeeds and Still Gets It Wrong

Agentic workflows are reshaping how engineering teams operate, fetching context, synthesizing decisions, and shipping results across systems without human intervention. But the same design that makes them powerful adds risk in production. Agents do not crash when they hit bad data; they synthesize around it, substituting a stale value, an empty page, or a missing field for the result they were supposed to capture.

The Next Evolution of Infrastructure Observability

Operational visibility is becoming increasingly important as infrastructure teams are asked to support AI initiatives, automation goals, cost accountability, modernization efforts, and growing operational complexity at the same time. Most are expected to do it without expanding headcount, introducing additional risk, or rebuilding the environment from scratch. Those expectations are changing the role of infrastructure operations.

Open Standards Observability - Prometheus & OpenTelemetry

Modern applications are distributed, ephemeral and built from a dozen moving parts. To keep them reliable, you need real visibility: not just “is the server up?”, but“how is this request behaving, right now, across every component it touches?”. The good news is that the observability world has converged on a handful of open standards — Prometheus for metrics, OpenTelemetry for telemetry, plus battle-tested protocols like StatsD and NRPE.

What is SRE Observability and Key Pillars You Should Know?

What happens when a critical service slows down, but nothing is technically “broken”? Most teams have monitoring in place. They know when something goes down. But when performance drops or issues spread across services, finding the real cause becomes slow and unclear. Engineering teams end up switching between dashboards, logs, and alerts just to understand what changed. This delays response and increases pressure on on-call teams. This is where SRE observability becomes essential.

It Can Only Goodhart Happen

When a measure becomes a target, it ceases to be a good measure. Charles Goodhart, 1975 You’ve probably read this quote in relation to any number of things over the years. People complaining about arbitrary metrics like PRs merged, lines of code produced, and now, token usage. But is the era of tokenmaxxing over before it even began? The rise of token leaderboards to the death of token leaderboards at companies like Amazon seem to have taken place in less than three months!

Running the OpenTelemetry Collector as a Lambda

The OpenTelemetry Collector is usually deployed as a long-running process: a sidecar, a DaemonSet, an EC2 instance, a docker container on my computer. It sits there listening for telemetry. That's fine when I want to send telemetry all day, but not when telemetry is rare. Like right now, when I have an agent defined on AgentCore, and it runs a few times a week maybe. Or my website that hardly sees any traffic. Can I run the OpenTelemetry Collector as a Lambda function?

MCP Servers Are Becoming a Core Interface Layer in Data Observability and Data Quality

Data observability has traditionally been built around human workflows. When data breaks, engineers are alerted, open dashboards, inspect lineage graphs, and manually trace the issue across pipelines. The system is designed for human investigation and interpretation. That model is now being challenged by the rise of AI agents in data operations. As organizations begin embedding AI into analytics, engineering, and decision-making workflows, observability is no longer just about explaining what happened - it must also enable systems to understand and act on it.
Sponsored Post

How APM fits into the modern observability stack

Most engineering teams don't have a data problem. They have an interpretation problem. Prometheus is running, logs are shipping to the aggregator, dashboards are green-and then a latency spike hits and the root cause takes 45 minutes to isolate. The data was there but the answer wasn't. That gap is where application performance monitoring (APM) operates. This article explores what APM adds to a modern observability stack, why relying on standalone tools leaves critical blind spots, and how teams can unify infrastructure data with application context for a complete operational picture.

Claude Code Observability at Scale: How We Did It With Bindplane

At Bindplane, we iterate fast. One of the most important tools we've adopted across our organization is Claude Code. It helps every team here build solutions to complex problems with both speed and precision. But speed without visibility is a liability. We needed a reliable way to monitor and audit how Claude Code was being used across our team. Luckily, we build the best platform on the market for data in motion.