Operations | Monitoring | ITSM | DevOps | Cloud

Synthetic Monitoring for GraphQL Endpoints: Beyond the Query

GraphQL isn’t just another API protocol—it’s a new layer of abstraction. It collapsed dozens of REST endpoints into one flexible interface where clients decide what data to fetch and how deep to go. That freedom is a gift for front-end teams and a headache for anyone tasked with reliability. Traditional monitoring doesn’t work here. A REST endpoint can be pinged for uptime.

Grafana Mimir 3.0 release: performance improvements, a new query engine, and more

In 2022, we introduced Grafana Mimir, our open source, horizontally scalable, multi-tenant time series database (TSDB) designed for long-term storage of Prometheus and OpenTelemetry metrics. Over the years, Mimir has become a go-to metrics backend within the open source community, with 30 project maintainers and more than 4.7k GitHub stars.

Stop the guesswork: Troubleshoot with confidence with process monitoring

IT infrastructure is vast, complex, and interdependent. At any point in time, businesses rely on thousands of servers running thousands of processes. Detecting server downtime is fairly easy—but true observability is when you know precisely which processes are working as intended and which are silently contributing to performance degradation. A failed database worker or a memory-leaking background service can silently drain resources until your most critical apps grind to a halt.

AI And Sustainability: Measuring The Impact Of The Generative AI Boom

Before 2022, Alex Hanna worked on Google’s Ethical AI team. Today, she’s the director of research at the Distributed AI Research Institute, a transition sparked by Google’s handling of a paper exposing AI’s growing environmental footprint. So, how bad is it, really? That depends on who you ask. Take Jesse Dodge, a senior research analyst at the Allen Institute for AI. Jesse told NPR that a single ChatGPT query can use as much electricity as keeping a light bulb on for 20 minutes.

The Outage Anxiety Test: Can You Answer These 3 Questions In Under 10 Minutes?

On Oct. 20, the Internet woke up and seemingly chose violence. For more than 12 hours, Amazon Web Services (AWS) went down. From banking platforms to hospital communications to mobile ordering apps, digital services came to a screeching halt. The cause? Two programs are trying to write a DNS entry simultaneously, failing, and leaving the entry blank. Thus began the incredibly costly failure cascade.

Accelerate your Azure integration setup with guided onboarding

Getting started with monitoring for Microsoft Azure environments can be a lengthy and manual process. Many tools require users to create app registrations, assign permissions, and enable log forwarding or telemetry data collection across multiple portals and scripts. These fragmented steps slow down onboarding and introduce opportunities for misconfiguration, making it harder for teams to quickly achieve full visibility.

Understand user experience through network performance with Datadog Synthetic Monitoring

When an application slows down or fails, pinpointing the cause isn’t always simple. Is it a backend regression, a misbehaving API, or a bottleneck somewhere deep in the network? Without full visibility, teams waste precious time troubleshooting across disconnected tools and layers. Datadog Synthetic Monitoring now supports Network Path to help you proactively identify whether user-facing issues stem from your code or from the underlying network.

Managing Alerts: Car Alarms and Smoke Alarms

Building and shipping an application is exciting, you watch your idea come alive and reach users. But once it’s out there, your real job begins: keeping it alive. An app in production isn’t just code running, it’s a living system. It needs monitoring to stay healthy and alerting to warn when something’s off. But there’s a catch: too few alerts, and you’ll miss real issues; too many, and you’ll drown in noise.

OTel Updates: Declarative Config - A Steadier Way to Configure OpenTelemetry SDKs

Application configs change over time, often in small ways that are easy to miss. They may start simple — a few environment variables, one exporter, nothing unexpected. As your instrumentation grows, you add rules for filtering health check spans, adjust sampling based on attributes, or introduce environment-specific resource settings. Each change makes sense on its own. But months later, the picture can look different across dev, staging, and production.