Operations | Monitoring | ITSM | DevOps | Cloud

Understanding Kafka with Speedscale #speedscale #kafka #visualization #engineering #production

In this video, we're breaking down the complex world of Apache Kafka and showing you how to gain deep visibility into your event streaming architecture using Speedscale. Kafka is the backbone of modern, cloud-native systems, but understanding what's happening in production—which topics are receiving traffic, where messages are going, and how services are interacting can be a real challenge. We'll cover how Speedscale makes Kafka visualization and debugging simple by.

How Much Did OpenAI's 30,000 CPU Core Optimization Save Them?

I admit I was a little skeptical going into KubeCon 2025. The last time I went, in 2022, it felt tactical. I heard lots of conversations around small solutions to small problems. Practical knowledge-sharing is of course beneficial, but I’m most inspired by the big picture — ideally, a picture bigger than you can see anywhere outside of your mind. I’m heartened to say that KubeCon 2025 was exactly that.

5 Reasons to Switch to the Calico Ingress Gateway (and How to Migrate Smoothly)

The Ingress NGINX Controller is approaching retirement, which has pushed many teams to evaluate their long-term ingress strategy. The familiar Ingress resource has served well, but it comes with clear limits: annotations that differ by vendor, limited extensibility, and few options for separating operator and developer responsibilities. The Gateway API addresses these challenges with a more expressive, standardized, and portable model for service networking.

Introducing Bits AI SRE, your AI on-call teammate

Bits AI SRE is your AI on-call teammate, built to autonomously investigate alerts and coordinate incident response. Integrated with Datadog, Slack, GitHub, Confluence, and more, Bits analyzes telemetry, reads documentation, and reviews recent deployments to determine the root cause of alerts—often before you’ve even opened your laptop. In fact, if you're using Datadog On-Call, you can view Bits’s findings right from your phone—so you’re always one step ahead, no matter where you are.

Inside The Builders Era: Why Developer Craft Matters More Than Ever

The software world has spent the last two years obsessed with one question: “Will AI replace developers?” Wrong question. The right question is: “How do developers stay in control while AI becomes part of the toolchain?” Welcome to The Builders Era, where the craft of software development and AI’s computational power meet on developer terms. Not as a replacement narrative. Not as a threat to our profession.

Mezmo + Catchpoint deliver observability SREs can rely on

For SREs juggling multiple services, third-party dependencies, and constant alerts, a critical service slowdown can quickly turn into chaos. APM Dashboards may show everything is fine, yet users are still experiencing problems. That gap—between application telemetry and real-world performance—can turn a five-minute fix into a two-hour war room. ‍

Build custom apps in seconds with conversational AI in App Builder

Datadog App Builder is a low-code tool for creating internal apps, making use of a drag-and-drop interface that allows engineering teams to troubleshoot issues, optimize operations, and enable self-service while connecting directly to their Datadog data and permissions. Now, with conversational AI, teams can go from idea to working prototype even faster.

What's Special About MCP?

AI agents can interact with the world using tools. Those tools can be generic or specific. For example: Generic: Specific: The most general ones, like “run a bash command” and “read and write files” are built into the agent. More specific ones are provided through Model Control Protocol (MCP) servers. Every tool provided to the agent comes with instructions sent as part of the context.

Installing TrackJS on Certkit

I recorded a video showing how to properly set up TrackJS for a new production website, specifically CertKit, our new certificate lifecycle management tool. The key to effective error monitoring isn’t just installing the tracking snippet, it’s configuring the system to surface real issues while filtering out the noise. I configure a forwarding domain (errors.certkit.io) to bypass ad blockers that might prevent error reporting.

Top Causes of Data Center Outages and How You Can Reduce Risk

Outages are less common than they once were, but when they happen, the impact is severe. According to the Uptime Institute Global Data Center Survey 2025, half of data center operators reported at least one impactful outage in the past three years, and one in ten of those caused a serious or severe disruption. The financial risk is just as significant. 20% of operators said their most recent outage cost more than $1 million when accounting for downtime, recovery, and reputational damage.