Training Foundation Models on a Trillion Data Points with Apache Iceberg
Training an AI foundation model on over a trillion data points sounds impossible without hitting your production systems. Here's how Datadog did it with Apache Iceberg for their time series forecasting model TOTO.
The key challenge: extracting massive historical observability data (metrics spanning years) and running incremental preprocessing pipelines without overwhelming production services. Iceberg solved this by providing schema governance, consistency guarantees, and seamless integration with ML tools like Ray and PyTorch.
The future gets even more exciting with GPU-optimized data formats coming to Iceberg. If you're building ML infrastructure or training models at scale, this is essential viewing.