Operations | Monitoring | ITSM | DevOps | Cloud

Optimize HPC jobs and cluster utilization with Datadog

High-performance computing (HPC) environments support some of the most critical workloads in the world—from asset pricing models in financial institutions to molecular simulations in drug discovery. These workloads often span hundreds of thousands of cores, depend on specialized infrastructure such as GPUs, and run for extended periods. As a result, performance and efficiency are critical.

Detect and map third-party outages with Datadog External Provider Status

Modern applications depend on dozens of external cloud platforms, APIs, and SaaS services to function. But when those providers experience issues, engineers often spend valuable time asking a basic question: Is the problem with us or with them? Provider-maintained status pages are often slow to update, leaving teams waiting for confirmation while incidents escalate. This delay wastes valuable time, prolongs investigations, and risks customer trust.

Building Intelligent Search: A Tutorial on Aiven for OpenSearch and Vertex AI

Aiven for OpenSearch is a fully-managed service that provides an ideal way to run OpenSearch on Google Cloud. It is designed for companies looking to operate search applications without taking on the burden and complexity of self-managing the infrastructure in the cloud. Running on Google Cloud, the service is built upon core infrastructure like Google Compute Engine, Google Cloud Storage, and Private Service Connect.

The Hidden Risk of DNS - Lessons from the AWS Outage & Why You Need DNS Spy Monitoring NOW

On October 20, 2025, much of the internet came to a halt. Apps wouldn’t load. Payments failed. Cloud dashboards went dark. From Fortnite to Alexa, Snapchat, and countless business platforms, users across the world were suddenly offline — all because DNS broke inside Amazon Web Services’ (AWS) US-East-1 region.

4 Everyday IT Headaches You Can Eliminate with Enterprise IT Automation

Every IT operator anywhere on the team ladder dreads this feeling: another day, another flood of service desk tickets. Like cockroaches, they come in waves and they’re repetitive. Worse still, they distract your teams from higher-value work. Ironically for the amount of disruption they can cause, most of these tickets are not complex incidents or novel challenges. They’re the same everyday IT headaches your enterprise has been dealing with for years.

Build Vs. Buy? Why Creating Your Own Cost Management Platform Is Futile

The siren song of building a custom, internal cloud cost management platform is enticing. Many brilliant engineering teams are convinced they can come up with a bespoke solution that perfectly fits their needs. They look at their company’s unique infrastructure and decide they can DIY cost management without having to rely on an external vendor. Believe me, I get the temptation.

Amazon Isn't Eating Its Own DNS Dog Food

On October 19-20, 2025, Amazon Web Services (AWS) experienced a significant outage (AWS status) affecting its US-EAST-1 region in northern Virginia. The root cause was DNS resolution failures for DynamoDB’s API endpoints, which cascaded across AWS’s interconnected services, disrupting major platforms including Snapchat, McDonald’s, Disney+, Roblox, Coinbas, Reddit, and Amazon’s own services.

How WWT Proves the Value of Agentic AIOps with LogicMonitor's Edwin AI

Agentic AI has entered day-to-day operations. Systems with the ability to act, learn, and adjust are already cutting noise, speeding remediation, and giving engineers time back for work that moves the business. In a recent webinar, Karthik SJ, General Manager, AI at LogicMonitor, and Mike Cervasio, Global Practice Manager, AIOps at World Wide Technology, explored what makes this new phase of AIOps actionable.