Datadog on Data Engineering Pipelines: Apache Spark at Scale

Datadog

Mar 28, 2023

Datadog is an observability and security platform that ingests and processes tens of trillions of data points per day, coming from more than 22,000 customers. Processing that amount of data in a reasonable time stretches the limits of well known data engines like Apache Spark.

In addition to scale, Datadog infrastructure is multi-cloud on Kubernetes and the data engineering platform is used by different engineering teams, so having a good set of abstractions to make running Spark jobs easier is critical.

In this session, Ara Pulido, Staff Developer Advocate, will chat with Anton Ippolitov, Senior Software Engineer in the Data Engineering Infrastructure team, and Alodie Boissonnet, Software Engineer in the Historical Metrics Query team. They will share their journey on building and maintaining their infrastructure and data engineering pipelines, as well as run and optimize Spark batch jobs, with real-work examples.

By the end of the talk you will have a better understanding of what value Spark brings to your organization, why Spark continues to be one of the most popular open source data engines, and how to use it at scale.

00:00 - Introduction to the episode

04:00 - Introduction to Apache Spark

12:26 - The Data Engineering Platform at Datadog

22:30 - Optimizing shuffle operations

31:49 - Tungsten format

36:50 - Parameter standardization

39:57 - Kubernetes pod allocation

44:45 - Kubernetes scheduling

49:16 - Q&A