Operations | Monitoring | ITSM | DevOps | Cloud

Improve your shift-left observability with the Datadog Service Catalog

Your applications are only as powerful as they are iterable. To keep up with their rapidly changing production environments, your teams need reliable CI/CD systems that implement best practices—including build and test automation, flaky test management, and deployment management. By optimizing their CI/CD pipelines, your teams can build their apps more efficiently, deploy them more safely, and catch bugs and security vulnerabilities before they make it to production.

Investigate your log processing with the Datadog Log Pipeline Scanner

Large-scale organizations typically collect and manage millions of logs a day from various services. Within these orgs, many different teams may set up processing pipelines to modify and enrich logs for security monitoring, compliance audits, and DevOps. Datadog Log Pipeline let you ingest logs from your entire stack, parse and enrich them with contextual information, add tags for usage attribution, generate metrics, and quickly identify log anomalies.

Monitor Ray applications and clusters with Datadog

Ray is an open source compute framework that simplifies the scaling of AI and Python workloads for on-premise and cloud clusters. Ray integrates with popular libraries, data stores, and tools within the machine learning (ML) ecosystem, including Scikit-learn, PyTorch, and TensorFlow. This gives developers the flexibility to scale complex AI applications without making changes to their existing workflows or AI stack.

Track service provider outages with IsDown and Datadog

When your apps and infrastructure rely on dozens of third-party providers for key functionality, it’s important to closely track their outages. If a service you rely on goes down, you need to move quickly to limit the outage’s impact on your users. IsDown provides a detailed status page aggregator and uptime monitoring for all your third-party dependencies.

Monitor your chaos engineering experiments with Steadybit's offering in the Datadog Marketplace

Steadybit is a software reliability platform that uses chaos engineering and fault injection to help organizations improve the stability and performance of their applications. By allowing customers to simulate turbulent scenarios in a controlled environment, Steadybit enables you to identify and mitigate potential system issues to reduce downtime and improve resilience.

A deep dive into CPU requests and limits in Kubernetes

In a previous blog post, we explained how containers’ CPU and memory requests can affect how they are scheduled. We also introduced some of the effects CPU and memory limits can have on applications, assuming that CPU limits were enforced by the Completely Fair Scheduler (CFS) quota. In this post, we are going to dive a bit deeper into CPU and share some general recommendations for specifying CPU requests and limits.

Highlights from AWS re:Invent 2023

Whether or not you made the journey to this year’s re:Invent, there’s always a variety of great announcements lost amid an action-packed week of keynotes, breakouts, expo hall demos, and networking sessions. No need to worry—we’re always happy to be a big part of the re:Invent experience and share our observations with you.

Introducing CoTerm, your collaborative terminal for pair programming and debugging

For too long, engineers have had to piece together an unwieldy combination of tools to collaboratively debug and resolve incidents while pair programming in real time. These activities normally require developers to work individually through a terminal, but the patchwork solutions that allow teams to work together in terminals all have significant drawbacks.

Monitor Amazon S3 Express One Zone with Datadog

Amazon Simple Storage Service (S3) now offers a high-performance storage class, S3 Express One Zone, that delivers consistent single-digit millisecond data access for your most latency-sensitive applications. Designed for your most frequently accessed datasets, S3 Express One Zone replicates and stores your data within a single AWS Availability Zone, scales to process millions of requests per minute, and uses hardware and software optimized for low latency.

Govern your infrastructure resources with the Datadog Resource Catalog

As an administrator of an expanding, highly distributed infrastructure, you may be responsible for overseeing thousands of on-premise and cloud resources from multiple providers—governed under dozens of accounts by a complex nest of RBAC rules. To query all these resources for purposes such as compliance audits and access management, you may be required to write custom scripts and painstakingly sift through data across disparate tools.