Operations | Monitoring | ITSM | DevOps | Cloud

Optimize and troubleshoot AI infrastructure with Datadog GPU Monitoring

As organizations bring more AI and LLM workloads into production, the underlying GPU infrastructure that supports these workloads becomes even more critical in ensuring these workloads remain fast, reliable, and scalable. Inefficient GPU resource usage, for instance, can lead to longer runtimes and reduced throughput, negatively impacting overall model performance. Additionally, idle and underutilized GPUs can quickly drive up costs and lead to needless spending.

How to Monitor Kafka Producer Metrics

Your Kafka producer pushed a million messages yesterday. Nice. But can you tell if they all made it? Or why did latency spike at 2 PM? Producer metrics help you determine that. They expose how long messages take to send, whether messages are getting stuck, and whether retries are piling up. Let’s go over which ones help while debugging and how to monitor them.

The Future of IT Is Human + Agentic: How Zero Ticket IT Is Reshaping Tech Careers

Automation has always stirred up fears of job loss. For IT professionals, the conversation has only grown louder with the rise of AI. But the truth is that the future of IT is not about replacement—it’s about reinvention. For decades, IT has been defined by its firefighting: manually resolving tickets, managing endless alerts, and fielding repetitive service requests. These tasks are ripe for automation, but automation doesn’t eliminate the need for IT talent.

Apache Spark security: start with a solid foundation

Everyone agrees security matters – yet when it comes to big data analytics with Apache Spark, it’s not just another checkbox. Spark’s open source Java architecture introduces special security concerns that, if neglected, can quietly reveal sensitive information and interrupt vital functions.

Implementing Grafana Play privacy policies with Grafana k6: A behind-the-scenes look

Grafana Play is a free and publicly accessible sandbox environment that allows users to explore and learn Grafana without setting up their own instance. Grafana Play comes preloaded with ready-made sample dashboards, and showcases how to work with different data sources, create visualizations, and use advanced Grafana features.

Yes, Sentry has an MCP Server (...and it's pretty good)

Unless you’ve been living under a rock, “MCP” is probably a term you’ve heard thrown around in the AI space. Each of the editors and LLM providers have been racing to add and enhance their MCP support. Sentry was fortunate enough to be included in Anthropics release announcements for MCP.

Cisco and Splunk Strengthen Enterprise Digital Resilience in the AI Era

In an era where hybrid environments and AI-driven innovations redefine enterprise operations, organizations face increasing complexity, disruption, and vulnerability in their systems. To overcome this growing challenge, Cisco and Splunk are working together to harness the power of AI to help customers ensure that digital resilience is an inherent part of their systems.