Latest Posts

Monitor, troubleshoot, improve, and secure your LLM applications with Datadog LLM Observability

Jun 26, 2024 By Thomas Sobolik In Datadog

Organizations across all industries are racing to adopt LLMs and integrate generative AI into their offerings. LLMs have been demonstrably useful for intelligent assistants, AIOps, and natural language query interfaces, among many other use cases. However, running them in production and at an enterprise scale presents many challenges.

Read Post

Datadog

Read more about Monitor, troubleshoot, improve, and secure your LLM applications with Datadog LLM Observability

Track the status of all your SLOs in Datadog

Jun 24, 2024 By Meghan Jordan In Datadog

Service level objectives, or SLOs, are a key part of the site reliability engineering toolkit. SLOs provide a framework for defining clear targets around application performance, which ultimately help teams provide a consistent customer experience, balance feature development with platform stability, and improve communication with internal and external users.

Read Post

Datadog

Read more about Track the status of all your SLOs in Datadog

Best practices for managing your SLOs with Datadog

Jun 24, 2024 By Mark Azer In Datadog

Collaboration and communication are critical to the successful implementation of service level objectives. Development and operational teams need to evaluate the impact of their work against established service reliability targets in order to improve their end user experience. Datadog simplifies cross-team collaboration by enabling everyone in your organization to track, manage, and monitor the status of all of their SLOs and error budgets in one place.

Read Post

Datadog

Read more about Best practices for managing your SLOs with Datadog

SLOs 101: How to establish and define service level objectives

Jun 24, 2024 By Mark Azer In Datadog

In recent years, organizations have increasingly adopted service level objectives, or SLOs, as a fundamental part of their site reliability engineering (SRE) practice. Best practices around SLOs have been pioneered by Google—the Google SRE book and a webinar that we jointly hosted with Google both provide great introductions to this concept. In essence, SLOs are rooted in the idea that service reliability and user happiness go hand in hand.

Read Post

Datadog

Read more about SLOs 101: How to establish and define service level objectives

Troubleshoot infrastructure faster with Recent Changes

Jun 21, 2024 By Sriram Raman In Datadog

Infrastructure changes often trigger incidents, but troubleshooting these incidents is challenging when responders have to navigate through multiple tools to correlate telemetry with configuration changes. This lack of unified observability leads to longer mean time to resolution (MTTR), greater operational stress, and ultimately, negative business outcomes.

Read Post

Datadog

Read more about Troubleshoot infrastructure faster with Recent Changes

Diagnose runtime and code inefficiencies in production by using Continuous Profiler's timeline view

Jun 20, 2024 By Guillaume Turbat In Datadog

When you face issues like reduced throughput or latency spikes in your production applications, determining the cause isn’t always straightforward. These kinds of performance problems might not arise for simple reasons such as under-provisioned resources; often, the root of the problem lies deep within an application’s runtime execution.

Read Post

Datadog

Read more about Diagnose runtime and code inefficiencies in production by using Continuous Profiler's timeline view

Troubleshoot and optimize data processing workloads with Data Jobs Monitoring

Jun 20, 2024 By Fionce Siow In Datadog

Data is central to any business: it powers mission-critical applications, informs business decisions, and supports the growing adoption of AI/ML models. As a result, data volumes are only increasing, and teams rely on engines like Apache Spark and managed platforms like Databricks or Amazon EMR to process this data at scale.

Read Post

Datadog

Read more about Troubleshoot and optimize data processing workloads with Data Jobs Monitoring

Remediate Google Cloud issues with new actions in Workflow Automation and App Builder

Jun 18, 2024 By Syed Sarjeel Yusuf In Datadog

Datadog Actions help you respond to alerts and manage your infrastructure directly from within Datadog. This can be done by creating workflows that automate end-to-end processes or by using App Builder to build resource management tools and self-serve developer platforms. With more than 550 available actions, Datadog Actions offers capabilities such as creating Jira tickets, resizing autoscaling groups, and triggering GitHub pipelines.

Read Post

Datadog

Read more about Remediate Google Cloud issues with new actions in Workflow Automation and App Builder

Build custom monitoring and remediation tools with Datadog App Builder

Jun 17, 2024 By Thomas Sobolik In Datadog

When you’re responding to an issue with your application in the heat of on-call, you need reliable, well-maintained tooling that’s painless to use. Otherwise, the time you’ll spend combing through monitoring data for context, connecting to hosts and other infrastructure resources, and pivoting between consoles for various managed services can add up quickly and slow your response.

Read Post

Datadog

Read more about Build custom monitoring and remediation tools with Datadog App Builder

Remediate faster with apps built using Datadog App Builder

Jun 17, 2024 By Alex Flinois In Datadog

When troubleshooting an issue or remediating an outage, engineers need tools that are accessible and easy to use, closely integrated with their services, and tailored to their teams’ specific requirements.

Read Post

Datadog

Read more about Remediate faster with apps built using Datadog App Builder

Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Monitor, troubleshoot, improve, and secure your LLM applications with Datadog LLM Observability

Track the status of all your SLOs in Datadog

Best practices for managing your SLOs with Datadog

SLOs 101: How to establish and define service level objectives

Troubleshoot infrastructure faster with Recent Changes

Diagnose runtime and code inefficiencies in production by using Continuous Profiler's timeline view

Troubleshoot and optimize data processing workloads with Data Jobs Monitoring

Remediate Google Cloud issues with new actions in Workflow Automation and App Builder

Build custom monitoring and remediation tools with Datadog App Builder

Remediate faster with apps built using Datadog App Builder

Monthly Archive

Follow Us