Operations | Monitoring | ITSM | DevOps | Cloud

Track the performance of your HPC workloads with Datadog's AWS PCS integration

AWS Parallel Computing Service (AWS PCS) is a managed service that helps users run and scale their high performance computing (HPC) workloads. AWS PCS uses Slurm, an open source workload manager, for scheduling and orchestrating simulations, which enables users to build their scientific and engineering models in a familiar HPC environment.

Monitor Windows Certificate Store with Datadog

The Windows Certificate Store is a critical component of any modern Windows environment. Certificates enable TLS encryption for Internet Information Services (IIS)-hosted applications, support certificate-based authentication in Active Directory, and help validate the identity of trusted Windows services. But if a certificate in your store expires, is revoked, or is part of a broken certificate chain, you risk instability and security gaps in your Windows environment.

Visually identify observability gaps with Cloudcraft in Datadog

Modern cloud environments are highly complex and dynamic, with critical services relying on large numbers of ephemeral resources. Ensuring observability coverage across this landscape is essential for troubleshooting, maintaining reliability, optimizing performance, and enforcing security standards. But as environments grow more elaborate and their ownership more dispersed, tracking observability coverage becomes increasingly challenging.

A practical guide to error handling in Go

When you first start coding in Go, you quickly learn how error handling in the language differs from error handling in languages such as Java, Python, JavaScript, or Ruby. In those languages, throwing an exception automatically generates a stack trace. Go, by contrast, provides no built-in error tracing to reveal an error’s origin.

Understanding dbt: basics and best practices

Data Build Tool (dbt) is an open source analytics engineering framework that enables teams to transform raw data that has been loaded into a warehouse like Snowflake, BigQuery, Redshift, or Databricks using SQL-based workflows. dbt is available in two main forms: dbt Core, the free and open source CLI tool, and dbt Cloud, a managed platform that adds scheduling, UI support, collaboration tools, and native integrations.

Visually identify and prioritize security risks using Cloudcraft

As cloud infrastructure becomes more dynamic and distributed, DevOps and security teams need to quickly detect risks and understand their context: where those risks live, how critical they are, and how to respond effectively. By surfacing misconfigurations, vulnerabilities, sensitive data risks, and identity threats directly on a real-time diagram of your infrastructure, Cloudcraft helps teams identify, prioritize, and remediate security issues at scale.

This Month in Datadog - August 2025

In the August episode of This Month in Datadog, Jeremy shares how you can make more informed cloud cost decisions, gain insights into your LiteLLM-powered applications, and secure Kubernetes infrastructure with Datadog Workload Protection. Later in the episode, Danny puts the spotlight on Datadog Kubernetes Autoscaling, which helps you deliver cost savings without sacrificing performance.

Eliminate cloud waste across AWS, Azure, and Google Cloud with Cloud Cost Recommendations

As organizations increasingly adopt multi-cloud strategies, identifying areas to reduce cloud spend has become highly complex and time consuming. While there are many reasons that organizations choose to run their infrastructure in a multi-cloud environment, many do so to comply with regional data requirements, take advantage of best-of-breed offerings, or avoid vendor lock-in.

We vibe coded a path tracer: Here's how we used static and dynamic analysis to fix it

When developing software, the longer you intend to keep a system around, the more important it becomes to prioritize its code quality. But as more organizations move toward microservice architectures and adopt agentic AI and LLMs into their development workflows, many engineering teams have increased their emphasis on accelerating developer velocity, often at the expense of code quality. This can often result in code that fails to meet standards for performance, reliability, and security.

What's new for scheduling and resource management in Kubernetes v1.34?

Kubernetes v1.34, which is scheduled for release August 27, 2025, focuses on improved scheduler visibility, deeper life cycle observability, and enhanced resource management. As always, the list of changes and improvements in the official changelog is extensive, and cluster operators may be wondering which changes are most important. If you're operating a monitoring platform or depend on deep Kubernetes observability, here's how a number of new features will affect your workflows.