Operations | Monitoring | ITSM | DevOps | Cloud

Mastering waits and timeouts in Playwright

If you have written any kind of end-to-end tests or UI tests you probably know that the greatest headache to deal with is test flakiness due to browser actions not behaving in the way that you expect them to behave. This flakiness can be a major bottleneck especially in CI/CD pipelines due to constant failures.

Reducing Alert Noise with Composite Alerts in Hosted Graphite

Traditional alerts are simple by design: if a metric crosses a threshold, fire an alert. While that simplicity makes alerts easy to configure, it also leads to alert noise, because single metrics rarely tell the full story and often trigger during non-actionable conditions. Hosted Graphite Composite Alerts solve this by allowing you to combine multiple alert conditions using logical expressions like AND (&&) and OR (||).

Why AI Automation for ITOps Needs Context Graphs

AI automation in ITOps fails because execution loses decision context, and context graphs turn incident history into durable execution memory that systems can actually reuse. AI automation for ITOps fails because it remembers what it did, but not why. Fixing an issue depends on what was tried last time, what failed, what worked, which exceptions were approved, and under what conditions. That information rarely lives in the system.

Green dashboards, red flags

A VP of Engineering (from a company I’m not allowed to name) told me recently: "You helped us find and fix real user-facing issues. Now we need to convince our CTO why that matters more than the standard SLO’s and systems." Here's the thing: your CTO is not wrong in measuring the systems and basic uptime. That’s the baseline though. They’re all trying to watch everything, but they’re seeing nothing as it relates to users.

The Hidden Cost of Idle Assets: How Poor Asset Performance Leaves 30% of Enterprise Assets Unused

Most enterprises believe that once an asset is purchased and recorded, its value is automatically realized. In reality, the opposite is often true. Poor asset performance silently erodes budgets, reduces operational agility, and creates long term inefficiencies that remain hidden for years. Studies consistently show that nearly 30 percent of assets owned by large organizations remain unused or severely underutilized.

What is HEAL Monitoring Tool? A Comprehensive Guide for IT Leaders

Your organization has invested heavily in monitoring tools for application performance, infrastructure monitoring tools for servers and databases, log monitoring tools, network monitoring tools, and third-party monitoring tools for specific services. But the actual problem is your IT team is drowning in that data. A single production issue generates 30+ alerts across applications, databases, servers, and monitoring tools, creating an alert flood that buries the actual problem.

When Things Go Wrong, Systems Should Help Humans - Not Fight Them

In the previous post, we explored how AI accelerates delivery and compresses the time between change and user impact. As velocity increases, knowing that something has gone wrong before users do becomes a critical capability. But detection is only the beginning. Once alerts fire and dashboards light up, humans still have to interpret what’s happening, make decisions under pressure, and act.

Easily Map Logs to OCSF with Datadog Observability Pipelines

Normalizing security logs into the Open Cybersecurity Schema Framework (OCSF) is often complex, manual, and time-consuming. With Datadog Observability Pipelines, you can easily transform logs into OCSF format—right in your own environment—before routing them to destinations like Splunk, CrowdStrike, and AWS Security Lake. This video show how Security teams can use Observability Pipelines to: Collect, process, and transform logs into OCSF format automatically.