The Journey to Achieving Hyperscale Availability with AI-Driven Prediction

Jun 30, 2026

At hyperscale, a regional cloud outage is not merely a technical disruption—for Samsung Account, which serves 2.1 billion users across three global regions, it is an immediate global service crisis. Fragmented, region-siloed monitoring creates blind spots that make early detection nearly impossible, leaving SRE teams perpetually reactive rather than predictive. The path to proactive reliability requires both a philosophical shift and a foundational change in how observability data is collected, unified, and reasoned over.

In this session, Samsung’s Je Min Kim (Dev Lead) and Junhee Kim (DevOps Engineer) share how their team built an agentic AI platform that included an AIOps Agent which, using Datadog MCP Server, predicted a major regional cloud failure before it happened. They also explain how that event catalyzed rebuilding their telemetry strategy around a single source of truth.

They walk through the real outage case study where AI-assisted analysis surfaced subtle precursor signals spanning services, infrastructure, managed databases, and DNS layers that no single alert would have caught. They also explain how Observability Pipelines and CloudPrem now serve as the unified telemetry foundation that enables their AI systems to reason with greater confidence at global scale.