Operations | Monitoring | ITSM | DevOps | Cloud

June 2024

Create Golden Paths for your development teams with Datadog App Builder and Workflow Automation

Improving the developer experience is a chief concern for many orgs who must maintain highly complex software architectures and platforms supported by an intricate web of internal processes. Platform engineering for Golden Paths seeks to address this by providing self-service tools, capabilities, and processes to help engineers start new projects in a more standardized, less mistake-prone way.

Optimize PostgreSQL performance with Datadog Database Monitoring

PostgreSQL is a widely used open source relational database that many organizations operate as a core part of their infrastructure stack. Because of their mission-critical nature, database-related issues can have outsize downstream impacts on user experience, service performance, and data retention, making it vital to identify and address problems quickly.

Datadog on LLMs: From Chatbots to Autonomous Agents

As companies rapidly adopt Large Language Models (LLMs), understanding their unique challenges becomes crucial. Join us for a special episode of "Datadog On LLMs: From Chatbots to Autonomous Agents," streaming directly from DASH 2024 on Wednesday, June 26th, to discuss this important topic. In this live session, host Jason Hand will be joined by Othmane Abou-Amal from Datadog’s Data Science team and Conor Branagan from the Bits AI team. Together, they will explore the fascinating world of LLMs and their applications at Datadog.

DASH 2024: Guide to Datadog's newest announcements

At this year’s DASH, we announced new products and features that enable your team to observe your environment, secure your infrastructure and workloads, and act to remediate problems before they affect customers. LLM Observability, which enables you to get deep visibility into your generative AI applications, is now generally available. The Datadog Agent now includes an embedded OTel Collector to provide native support for OpenTelemetry.

Unify your OpenTelemetry and Datadog experience with the embedded OTel Collector in the Agent

OpenTelemetry (OTel) is an open source, vendor-neutral observability solution that consists of a suite of components—including APIs, SDKs, and the OTel Collector—that allow teams to monitor their applications and services in a standardized format. OTel defines this data via the OpenTelemetry Protocol (OTLP), a standard for the encoding and transfer of telemetry data that organizations can use to collect, process, and export telemetry and route it to observability backends, such as Datadog.

Monitor, troubleshoot, improve, and secure your LLM applications with Datadog LLM Observability

Organizations across all industries are racing to adopt LLMs and integrate generative AI into their offerings. LLMs have been demonstrably useful for intelligent assistants, AIOps, and natural language query interfaces, among many other use cases. However, running them in production and at an enterprise scale presents many challenges.

Track the status of all your SLOs in Datadog

Service level objectives, or SLOs, are a key part of the site reliability engineering toolkit. SLOs provide a framework for defining clear targets around application performance, which ultimately help teams provide a consistent customer experience, balance feature development with platform stability, and improve communication with internal and external users.

Best practices for managing your SLOs with Datadog

Collaboration and communication are critical to the successful implementation of service level objectives. Development and operational teams need to evaluate the impact of their work against established service reliability targets in order to improve their end user experience. Datadog simplifies cross-team collaboration by enabling everyone in your organization to track, manage, and monitor the status of all of their SLOs and error budgets in one place.

SLOs 101: How to establish and define service level objectives

In recent years, organizations have increasingly adopted service level objectives, or SLOs, as a fundamental part of their site reliability engineering (SRE) practice. Best practices around SLOs have been pioneered by Google—the Google SRE book and a webinar that we jointly hosted with Google both provide great introductions to this concept. In essence, SLOs are rooted in the idea that service reliability and user happiness go hand in hand.

Troubleshoot infrastructure faster with Recent Changes

Infrastructure changes often trigger incidents, but troubleshooting these incidents is challenging when responders have to navigate through multiple tools to correlate telemetry with configuration changes. This lack of unified observability leads to longer mean time to resolution (MTTR), greater operational stress, and ultimately, negative business outcomes.

Diagnose runtime and code inefficiencies in production by using Continuous Profiler's timeline view

When you face issues like reduced throughput or latency spikes in your production applications, determining the cause isn’t always straightforward. These kinds of performance problems might not arise for simple reasons such as under-provisioned resources; often, the root of the problem lies deep within an application’s runtime execution.

Troubleshoot and optimize data processing workloads with Data Jobs Monitoring

Data is central to any business: it powers mission-critical applications, informs business decisions, and supports the growing adoption of AI/ML models. As a result, data volumes are only increasing, and teams rely on engines like Apache Spark and managed platforms like Databricks or Amazon EMR to process this data at scale.

Monitor your AWS generative AI Stack with Datadog

As organizations increasingly leverage generative AI in their applications, ensuring end-to-end observability throughout the development and deployment lifecycle becomes crucial. This webinar showcases how to achieve comprehensive observability when deploying generative AI applications on AWS using Amazon Bedrock and Datadog.

Remediate Google Cloud issues with new actions in Workflow Automation and App Builder

Datadog Actions help you respond to alerts and manage your infrastructure directly from within Datadog. This can be done by creating workflows that automate end-to-end processes or by using App Builder to build resource management tools and self-serve developer platforms. With more than 550 available actions, Datadog Actions offers capabilities such as creating Jira tickets, resizing autoscaling groups, and triggering GitHub pipelines.

Build custom monitoring and remediation tools with Datadog App Builder

When you’re responding to an issue with your application in the heat of on-call, you need reliable, well-maintained tooling that’s painless to use. Otherwise, the time you’ll spend combing through monitoring data for context, connecting to hosts and other infrastructure resources, and pivoting between consoles for various managed services can add up quickly and slow your response.

Focus on code that matters with source code previews in Continuous Profiler

The use of code profiling to troubleshoot application performance can appear daunting to the uninitiated, and many software engineers even assume that this domain is reserved for niche specialists. But here at Datadog, one of the key goals for our Continuous Profiler product has been to take this seemingly intimidating practice of code profiling and make it more accessible to engineers at all levels.

State of Cloud Costs

Organizations face significant challenges in increasing the efficiency of their growing cloud spending, even as the flexibility and variety of available cloud services offer many opportunities for optimization. Cloud environments are complex and dynamic due to the breadth of services and the drive to adopt new technologies, such as Arm-based processors and GPUs that enable AI capabilities.

Monitor AWS Batch on Fargate with Datadog

AWS Batch on Fargate is an AWS offering that combines the benefits of AWS Fargate—a serverless compute engine for deploying and managing containers—with AWS Batch, a fully managed service for running batch workloads. Leveraging a pay-per-use pricing model and automatic scaling, AWS Batch on Fargate provides you with a cost-effective and scalable solution for running batch computing workloads without needing to worry about managing any underlying infrastructure.

Monitor Snowflake Snowpark with Datadog

Snowflake is an AI data cloud platform that breaks down silos within an organization to enable wider collaboration with partners and customers for storing, managing, and analyzing data. With Snowpark and Snowpark Container Services (SPCS), organizations can leverage a set of libraries and execution environments directly in Snowflake to build applications and pipelines with familiar programming languages like Python and Java, all without having to move data across tools or platforms.

Getting started with the Datadog mobile app

The Datadog mobile app can help you make the most of the deep visibility Datadog gives you into your applications and infrastructure. In addition to helping you monitor key metrics, facilitating alerting, and smoothing the way for coordination among teams, the mobile app gives you the resources and context to investigate issues and respond to incidents from anywhere.

How Datadog's Infrastructure team manages internal deployments using the Service Catalog and CI/CD Visibility

Managing the software development lifecycle of your applications is a complex task. Releasing software updates in a large and ever-changing ecosystem requires visibility into the state of your services and insight into how changes to these services impact the reliability, performance, security, and cost of your application. The stages of software delivery are often sharded across multiple tools, each purpose-built for a specific slice of your application lifecycle.