Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

How Replicas Work in Kubernetes

Replicas in Kubernetes control how many copies of your pods run simultaneously. They're the foundation of scaling, availability, and recovery in your cluster. When you're running a stateless API or a background worker, understanding how replicas work directly impacts your application's reliability and performance. This blog walks through replica management, from basic concepts to production monitoring patterns that help you maintain healthy, scalable applications.

Instrument LangChain and LangGraph Apps with OpenTelemetry

In our previous blog, we talked about how LangChain and LangGraph help structure your agent’s behavior. But structure isn’t the same as visibility. This one’s about fixing that. Not with more logs. Not with generic dashboards. You need to see what your agent did, step by step, tool by tool, so you can understand how a simple query turned into a long, expensive run.

Prometheus Group By Label: Advanced Aggregation Techniques for Monitoring

Your Prometheus dashboard shows 847 CPU metrics. The alert fired—but is the problem in us-east or us-west? You're trying to rule out whether that new feature caused a latency spike, but the sheer number of time series isn’t helping. Grouping can make this manageable. By organizing metrics by shared label values, you can quickly spot which service or region is behaving differently, without digging through every metric.

Docker Status Unhealthy: What It Means and How to Fix It

If your container shows Status: unhealthy, Docker's health check is failing. The container is still running, but something inside, usually your app, isn’t responding as expected. This doesn’t always mean a crash. It just means Docker can’t verify the app is working. Here’s how to debug the issue and restore the container to a healthy state.

LangChain Observability: From Zero to Production in 10 Minutes

LangChain apps are powerful, but they’re not easy to monitor. A single request might pass through an LLM, a vector store, external APIs, and a custom chain of tools. And when something slows down or silently fails, debugging is often guesswork. In one instance, a developer ended up with an unexpected $30,000 OpenAI bill, with no visibility into what triggered it. This blog shows how to avoid that using OpenTelemetry and LangSmith. With this setup, you’ll be able to.

Balancing Reliability at the Crypto-Finance Frontier with Brian Shaw (Uphold)

Sylvain Kalache sits down with Brian Shaw, Senior Engineering Leader at Uphold, to explore the reliability challenges that arise when operating at the intersection of traditional finance and crypto markets. Brian shares how unexpected market events can create massive traffic spikes, how their platform architecture and Kubernetes setup help them stay resilient, and why Uphold's transparency and regulatory approach make them both trustworthy and a high-profile target.

LangChain & LangGraph: The Frameworks Powering Production AI Agents

Your AI agent worked flawlessly in development, with fast responses, clean tool use, and nothing out of place. Then it hit production. A simple "What's our pricing?" query triggered six API calls, took 8 seconds, and returned the wrong answer. No errors. No stack traces. Unlike traditional systems, AI agents don't crash, they drift. They make poor decisions quietly, and your monitoring says everything's fine.

How to Run Elasticsearch on Kubernetes

Elasticsearch stands as one of the most robust open-source search engines available today. Built on Apache Lucene, it handles complex search operations, real-time analytics, and large-scale data processing with impressive speed and accuracy. Kubernetes has transformed how we deploy and manage containerized applications. This orchestration platform automates deployment, scaling, and operations of application containers across clusters of hosts.

Logging in Docker Swarm: Visibility Across Distributed Services

Docker Swarm's logging model shifts from individual container logs to service-level aggregation. The docker service logs command batch-retrieves logs present at the time of execution, pulling data from all containers that belong to a service across your cluster. This approach gives you a unified view of distributed applications, but it comes with its patterns and considerations for effective observability.

How to Write Logs to a File in Go

When your Go application moves beyond development, you need structured logging that persists. Writing logs to files gives you the control and reliability that stdout can't match, especially when you're debugging production issues or need to meet compliance requirements. This blog walks through the practical approaches, from Go's standard library to structured logging with popular packages.