Operations | Monitoring | ITSM | DevOps | Cloud

Kafka

Apache Kafka Example: How Rollbar Removed Technical Debt - Part 2

April 7th, 2020 • By Jon de Andrés Frías In the first part of our series of blog posts on how we remove technical debt using Apache Kafka at Rollbar, we covered some important topics such as: In the second part of the series, we’ll give an overview of how our Kafka consumer works, how we monitor it, and which deployment and release process we followed so we could replace an old system without any downtime.

Monitoring Kafka performance metrics

Kafka is a distributed, partitioned, replicated, log service developed by LinkedIn and open sourced in 2011. Basically it is a massively scalable pub/sub message queue architected as a distributed transaction log. It was created to provide “a unified platform for handling all the real-time data feeds a large company might have”.Kafka is used by many organizations, including LinkedIn, Pinterest, Twitter, and Datadog. The latest release is version 2.4.1.

Collecting Kafka performance metrics

If you’ve already read our guide to key Kafka performance metrics, you’ve seen that Kafka provides a vast array of metrics on performance and resource utilization, which are available in a number of different ways. You’ve also seen that no Kafka performance monitoring solution is complete without also monitoring ZooKeeper. This post covers some different options for collecting Kafka and ZooKeeper metrics, depending on your needs.

Monitoring Kafka with Datadog

Kafka deployments often rely on additional software packages not included in the Kafka codebase itself—in particular, Apache ZooKeeper. A comprehensive monitoring implementation includes all the layers of your deployment so you have visibility into your Kafka cluster and your ZooKeeper ensemble, as well as your producer and consumer applications and the hosts that run them all.

5 Pitfalls to Kafka Architecture Implementation

Let’s face it—distributed streaming is an exciting technology that can be leveraged in many ways. Use cases include messaging, log aggregation, distributed tracing, and event sourcing, among others. Distributed streaming can result in significant benefits for companies that choose to use it, but, when not implemented correctly, it can initiate a frustrating technical debt cycle. How do you know if you’re properly implementing Kafka in your environment?

Monitor Amazon Managed Streaming for Apache Kafka with Datadog

Amazon Managed Streaming for Apache Kafka (MSK) is a fully managed service that allows developers to build highly available and scalable applications on Kafka. In addition to enabling developers to migrate their existing Kafka applications to AWS, Amazon MSK handles the provisioning and maintenance of Kafka and ZooKeeper nodes and automatically replicates data across multiple availability zones for high availability.

Kafka Data Pipelines for Machine Learning Enterprise Applications

Traditional enterprise application platforms are usually built with Java Enterprise technologies and this is the case as well for OpsRamp. However, in machine learning (ML) world, Python is the most commonly used language, with Java rarely used. To develop ML components within enterprise platforms, such as the AIOps capabilities in OpsRamp, we have to run ML components as Python microservices and they communicate with Java microservices in the platform.

Avoiding death by external side effects - a tale of Kafka Streams

At Coralogix, we strive to ensure that our customers get a stable, real-time service at scale. As part of this commitment, we are constantly improving our data ingestion pipeline resiliency and performance. Coralogix ingests messages at extremely high rates — up to tens of billions of messages per day. Every one of these records needs to go through our entire pipeline at near real-time rates: validation, parsing, classification, and ingestion to Elasticsearch.

Lessons learned from running Kafka at Datadog

At Datadog, we operate 40+ Kafka and ZooKeeper clusters that process trillions of datapoints across multiple infrastructure platforms, data centers, and regions every day. Over the course of operating and scaling these clusters to support increasingly diverse and demanding workloads, we’ve learned a lot about Kafka—and what happens when its default behavior doesn’t align with expectations.