Datadog on Kubernetes Node Management

Datadog

Oct 18, 2023

Datadog, the observability platform used by thousands of companies, runs on dozens of self-managed Kubernetes clusters in a multi-cloud environment, adding up to tens of thousands of nodes, or hundreds of thousands of pods. This infrastructure is used by a wide variety of engineering teams at Datadog, with different feature and capacity needs.

How do we make sure that tens of thousands of nodes, with very different specifications and on different clouds are healthy, updated with the latest security patches, and running an updated version of the kubelet and container runtime, without breaking applications or interrupting more than a thousand engineers that rely on this infrastructure for their daily job?

In this episode, Adrien Trouillaud, Engineering Manager and David Benque, Staff Software Engineer, both part of the Compute team, shared their strategies, lessons learned, and practical tips on how to successfully manage a huge fleet of Kubernetes nodes.

00:00 - Introduction

03:57 - Life of a Node

09:14 - Provisioning & Autoscaling

14:06 - Scaling Down

19:06 - Eviction API

34:00 - Draining

42:56 - Node Problem Detector

47:14 - Mass Eviction

52:12 - Q&A