Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Troubleshoot anomalies in workload performance with Watchdog Insights and Alerts for Live Processes

Processes—the service workloads that run on your infrastructure—are the building blocks of your application, and it’s critical to know how well they operate at every level of the stack. Degraded process performance can lead to downtime for your mission-critical services, resulting in loss of customer trust and potentially impacting revenue for the business.

How to monitor etcd with Datadog

So far in this series, we’ve walked through key etcd metrics and tools you can use to monitor etcd metrics and logs. In this post, we’ll show you how you can monitor etcd with Datadog, including how to: But first, we’ll show you how to set up and configure the Datadog Agent and Cluster Agent to send etcd monitoring data to your Datadog account.

Tools for collecting etcd metrics and logs

In Part 1 of this series, we looked at how etcd works and the role it plays in managing the state of a Kubernetes cluster. We also explored key etcd metrics you should monitor to ensure the health and performance of your etcd cluster. In this post, we’ll show you how you can use tools like Prometheus, Grafana, and etcdctl to collect and visualize etcd metrics. We’ll also show you how to collect etcd logs that provide context for those metrics.

Key metrics for monitoring etcd

Etcd is a distributed key-value data store that provides highly available, durable storage for distributed applications. In Kubernetes, etcd functions as part of the control plane, storing data about the actual and desired state of the resources in a cluster. Kubernetes controllers use etcd’s data to reconcile the cluster’s actual state to its desired state. This series focuses on monitoring etcd in Kubernetes.

Monitor the Windows Registry with Datadog

The Windows Registry is a centralized key-value database that stores permissions, user data, and configuration settings for the Windows operating system and many Windows native applications. The keys stored in the registry provide a granular view into the processes occurring on a Windows host, such as certificate expirations, security checks, and pending reboots.

Measure long-term user engagement with Datadog Retention Analysis

It’s relatively easy to study the immediate impact of new releases by analyzing short-term changes in user behavior or system activity. However, this information doesn’t tell you much about the long-term viability of your application, which depends less on the novelty of major application updates and more on sustained usability.

Centralize, triage, and track tickets with Datadog Case Management

Complex systems require many different monitors to assess the health of their infrastructure and applications, creating a wealth of alerts that can be hard to track. Due to a lack of effective triage processes, many organizations page engineers for every alert that comes in, making it difficult to separate false positives from issues that actually require immediate attention.

Analyze the root causes and business impact of production issues with Trace Queries

Tracing provides indispensable insights into the state and performance of distributed applications, but it can often be difficult to determine the root cause or ultimate business impact of issues indicated by traces. Translating visibility of individual microservices into broader performance insights often requires drawing complex correlations between spans. This can be a laborious process, which can complicate everything from troubleshooting and triage to tracking KPIs and managing costs.

Quickly spot and revert faulty deployments with Change Overlays

Faulty deployments and other types of erroneous changes may account for around 70% of all application outages. With the prevalence of CI/CD workflows, engineering teams make changes to their applications, services, and infrastructure all the time, which can make it difficult to trace issues to specific changes.

Monitor Windows Performance Counters with Datadog

The Windows operating system exposes metrics such as CPU, memory, and disk usage as built-in performance counters, which provide a unified way to observe performance, state, and other high-level facets of Windows subsystems, components, and native or third-party applications. As such, Windows Performance Counters can be invaluable for monitoring resource usage and the health of your infrastructure, as well as systems your services are using.