Operations | Monitoring | ITSM | DevOps | Cloud

Trace Google Pub/Sub workloads in Cloud Run with Datadog

Event-driven systems are great at decoupling services, but they also make incidents harder to untangle. A single user request can turn into dozens (or thousands) of messages, multiple consumers, retries, and delayed acknowledgments. If your tracing only tells you that a message was sent or received, you still have to guess which upstream request produced the message, whether a batch publish fanned out cleanly, and where queue time is accumulating.

How to optimize JavaScript code with CSS

When to use JavaScript or CSS in frontend projects is a matter of continued debate among many frontend developers. JavaScript is often the default choice for frontend development, as it offers a robust collection of libraries custom-made for creating advanced UI features, such as data-based visualizations or complex animations. But JavaScript also comes with tradeoffs, particularly when it comes to performance, accessibility, and code complexity.

Protect agentic AI applications with Datadog AI Guard

Organizations are increasingly using agentic AI applications powered by large language models (LLMs) to automate analysis, decision-making, and operational workflows. As these AI agents take on more responsibility, they gain access to internal tools and services and can interact with them in unintended ways.

How to choose the right on-call rotation

Choosing an on-call rotation is about finding a rhythm that balances your team’s well-being and your system’s reliability. The right on-call rotation helps prevent burnout and makes on-call duties sustainable over the long run. This guide walks you through different on-call rotation patterns, from daily rotation to after-hours rotations. We’ll look at why you might choose a particular rotation and the challenges that often come with it.

Why a month is too long to be on-call

There is often a temptation to stretch on-call shifts to a month or longer, especially when incident volume is low. The logic seems sound. If the phone rarely rings, it feels unnecessary to hand off on-call duties every week. But looking strictly at incident volume often misses the human side of the equation. Being on-call isn’t just about answering pages. It is also a state of mind. Even when it is quiet, simply being on-call could create fatigue of its own.

EasyVista Service Manager + SIGNL4

Modern IT service management platforms excel at structuring work: tickets, workflows, approvals, SLAs, and reporting. But when a major incident occurs, success depends on more than clean processes – it depends on how fast the right people are reached and respond. This is where EasyVista Service Manager (EVSM) and SIGNL4 work exceptionally well together.

OpenTelemetry Instrumentation Best Practices for Microservices Observability

OpenTelemetry instrumentation is the foundation of modern microservices observability, but getting it right in production requires more than just enabling auto-instrumentation. This guide covers production-tested OpenTelemetry best practices that help engineering teams achieve reliable distributed tracing, control observability costs, and extract maximum value from their telemetry data.

How to Implement Distributed Tracing in Microservices with OpenTelemetry Auto-Instrumentation

This guide shows you how to implement OpenTelemetry’s auto-instrumentation for complete distributed tracing across your microservices, from initial setup through production optimization and troubleshooting.

How HVAC Companies, Contractors and Property Management Firms Use OnPage for Emergency Response

Over the past couple of weeks, as snowstorms and extreme cold swept across much of the Northeast, something interesting started happening on our end at OnPage. Our phones lit up. Not from healthcare teams or IT operations/tech teams, which is where many people expect us to be used, but from HVAC companies, contractors, and property management firms scrambling to prepare for what they knew was coming.