Datadog on Data Engineering Pipelines: Apache Spark at Scale
Datadog is an observability and security platform that ingests and processes tens of trillions of data points per day, coming from more than 22,000 customers. Processing that amount of data in a reasonable time stretches the limits of well known data engines like Apache Spark.
In addition to scale, Datadog infrastructure is multi-cloud on Kubernetes and the data engineering platform is used by different engineering teams, so having a good set of abstractions to make running Spark jobs easier is critical.
In this session, Ara Pulido, Staff Developer Advocate, will chat with Anton Ippolitov, Senior Software Engineer in the Data Engineering Infrastructure team, and Alodie Boissonnet, Software Engineer in the Historical Metrics Query team. They will share their journey on building and maintaining their infrastructure and data engineering pipelines, as well as run and optimize Spark batch jobs, with real-work examples.
By the end of the talk you will have a better understanding of what value Spark brings to your organization, why Spark continues to be one of the most popular open source data engines, and how to use it at scale.
00:00 - Introduction to the episode
04:00 - Introduction to Apache Spark
12:26 - The Data Engineering Platform at Datadog
22:30 - Optimizing shuffle operations
31:49 - Tungsten format
36:50 - Parameter standardization
39:57 - Kubernetes pod allocation
44:45 - Kubernetes scheduling
49:16 - Q&A