Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Log Management, Log Analytics and related technologies.

Bindplane | Blueprints for ClickHouse: Optimize Telemetry Before It Hits ClickStack

Chelsea from the Customer Success team walks through the Bindplane Blueprints for ClickHouse guide — showing how to optimize logs, metrics, and traces before they land in ClickStack. You’ll see how to: ClickHouse is powerful. But raw telemetry at scale gets expensive fast. Bindplane acts as the control plane for your OpenTelemetry infrastructure. Blueprints let you apply production-ready processing logic instantly without YAML sprawl or config drift.

OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It

OpenTelemetry almost always works beautifully in staging, demos, and videos. You enable auto-instrumentation, spans appear, metrics flow, the collector starts, and dashboards light up. Everything looks clean and predictable. However, production has a way of humbling even the most carefully prepared setups. When real traffic hits, and it always spikes sooner or later, you start seeing dropped spans.

How a Singleton Pattern Broke Our Django Logging

With modern tooling and agentic coding assistants, straightforward bugs are almost a relief. If a test can catch it, or a user can reproduce it, chances are you can squash it quickly. The harder category — and the one worth writing about — are the bugs where everything looks correct. Your code runs, no exceptions are thrown, your debug statements confirm the right functions fire at the right times, and yet nothing works.

Introducing "Explain Flame Graph": Stop Fighting Fires and Start Explaining Them

In a modern observability deployment, it’s simple to get data that helps you understand where your system is failing. However, when we try to understand why, the answer is often buried beneath a mound of stack traces. For many developers, attempting to interpret a flame graph by manually calculating self-time (the resources consumed by the function itself) versus child-frame latency (the time spent waiting on called sub-functions) is both confusing and time-consuming.

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

Distributed tracing doesn’t just show you what happened. It shows you why things broke. While logs tell you a service returned a 500 error and metrics show latency spiked, only traces reveal the full chain of causation: the upstream timeout that triggered a retry storm, the N+1 query pattern that saturated your connection pool, or the missing cache hit that turned a 50ms call into a 3-second database roundtrip.

AI observability: The backbone of mission resilience in the public sector

Downtime cost the public sector $193 million last year — and the financial hit is only the beginning. Beyond the numbers, downtime in the public sector can also lead to severe consequences for citizens: interrupted access to critical online services, delayed benefits, and stalled emergency response. When citizens cannot rely on government services, downtime becomes more than an inconvenience; it becomes a matter of trust. More than uptime, resilience is the new success metric for modern government.

Troubleshooting & RCA with Olly

If troubleshooting still feels harder than it should, check on these two numbers: how many dashboards you have, and how many alerts fire every day. For most teams, it’s hundreds of dashboards and thousands of alerts, a sign of maturity, coverage, and good intentions. On the other hand, we also see that when something actually breaks, that coverage rarely turns into clarity fast enough.

Splunk Attack Range v5 Demo

The Splunk Attack Range is an open source project that lets security teams spin up instrumented cloud environments, simulate adversary behavior, and use the generated telemetry to build and test detections in Splunk. Whether you are a detection engineer tuning rules, a purple team validating coverage, or a developer automating tests, Attack Range gives you a repeatable, cloud-based lab. This post highlights what Attack Range does, how it works, and how to get started - whether you prefer a web UI, a REST API, or the command line.