Operations | Monitoring | ITSM | DevOps | Cloud

Latest Videos

Building a Multi-Tenant Insurance Platform

In 2020, CoverWallet—a multi-tenant insurance platform—was acquired by Aon, which led to a rapid expansion in both the size and global presence of its engineering organization. In his talk, CoverWallet’s Hylke Alons walks through the changes that were necessary to meet their platform's new expectations, including improving growth and scalability while ensuring reliability, automating security, and reducing maintenance. He also discusses some best practices for scaling up engineering and product teams to handle demand in a complex and highly regulated industry like insurance.

Auditing Your Automation's Access: Using More Automation

Between CI/CD pipelines, container orchestrators, and developer debugging tools, more and more automation is needed to scale your systems. But how do you know if that automation is accessing the right systems at the right time? And how do you ensure that your automation is safe from exploits by unauthorized users?

I've Made a Huge Mistake: Implementing Agile on Infrastructure Teams

Bad planning methods can damage team morale and prevent teams from improving the systems they maintain. In this talk, Sam Handler from Shopify explains how his attempts to fix poor infrastructure planning processes through Agile methods failed. Drawing from this experience, he offers several principles that can help infrastructure teams improve the way they work.

Scaling Up, One Network Bottleneck at a Time

Processing data at scale involves moving packets through a network—but what happens when that network isn't cooperative? Anatole Beuzon, a Software Engineer at Datadog, discusses how he investigated and resolved network issues in Datadog’s larger data-processing apps and how you can apply these same methods to your own production workloads.

Ask a Site Reliability Engineer (SRE)

Site reliability engineering (SRE) can be complicated, and at Datadog, we’ve spent a lot of time thinking about SRE and refining how we implement it. Join Datadog’s Brandon West and Rick Mangi as they provide a brief overview of SRE and its core concepts. This video also contains a Q&A session from the live taping of this panel.

FinOps and Cloud Cost Optimization

As companies scale, it’s become increasingly important to keep cloud cost management and optimization top of mind. In this talk, Yuval Yogev from Sygnia walks you through Sygnia’s optimization journey of cutting their total cloud costs in half. Yogev also shares insights into how you can optimize your own organization’s cloud usage and spend.

Deploying OpenTelemetry Organizationally: From Proof of Concept to In-Production at Scale

Observability involves telling a coherent story about an entire system. Over the years, video streaming service Pluto TV has had to navigate many storytellers in terms of observability vendors, tools, and formats before settling on OpenTelemetry to analyze and compare features across its many destination platforms. During this presentation, you'll see how Bharathi Ramachandran—Engineering Manager at Pluto TV—used OpenTelemetry to implement his initial proof of concept and get his entire organization shipping observability data at scale.

Changing Perspectives: A Deep Dive into the Security Posture of 600+ Real-World AWS Environments

Earlier this year, Datadog released the “State of AWS Security” study, which examined real-world data from more than 600 organizations and AWS accounts to understand the security posture of global AWS users who also leverage the Datadog Cloud Security Platform. Join Datadog’s Christophe Tafani-Dereeper and Andrew Krug as they explore some important insights from this study, such as the top ways organizations are breached on AWS and how tooling like Datadog Cloud Security Posture Management can help.

What is Kafka?

Apache Kafka is a popular open source platform for streaming, storing, and processing high volumes of data. In this video, we break down how Kafka works and how it’s able to provide you with a reliable, scalable, and highly performant service for managing events. We also touch on some key resources for effectively monitoring your Kafka deployments via Datadog.

When Cloud Native Stacks Misbehave - Pitfalls and Lessons Learned | Itiel Shwartz (Komodor)

In this session, Itiel Shwartz will demonstrate common failure scenarios - both app and infra related. We will laugh a little and cry a little, and then cover monitoring, observability & troubleshooting best practices methodologies such as metrics, distributed tracing, logging, network visualization and more. But cheer up! We’ll wrap up by introducing some helpful tools, in order to find and fix issues as fast as possible.