Operations | Monitoring | ITSM | DevOps | Cloud

Actionable Network Device Monitoring with Automated Anomaly Detection and AI Troubleshooting

Network device monitoring is often a mess of polling, graphs, and alerts that don't lead to answers. In this webinar, we'll show how to monitor routers, switches, and firewalls in a way that quickly surfaces what matters: interface health, errors, drops, saturation, latency signals, and performance regressions—without drowning in noise. You'll learn how Netdata turns raw SNMP metrics into high-signal insights using automated anomaly detection and AI-assisted troubleshooting, so your team can move from 'something is wrong' to 'here's the root cause' faster.

Can We Still Trust the Code? #speedscale #qualityassurance #digitaltwin #trust #devops

The "Velocity Gap" is real. AI like Claude and GitHub Copilot are pumping out code faster than ever, but there’s a catch: Engineers don't trust it yet. We’re moving away from the old days of "clicking around" in a test environment, but how do we verify code at the speed of light? Ken breaks down why the future of QA isn't just "testing," it’s simulation. Video collab with @ScottMooreConsultingLLC Learn More: speedscale.com.

AI SRE in Practice: Resolving Node Termination Events at Scale

When a node terminates unexpectedly in a Kubernetes cluster, the immediate symptoms are obvious. Workloads restart elsewhere, services experience partial outages, and alerts fire across multiple systems. The harder question is why it happened and how to prevent it from recurring. This scenario walks through a node termination event where the entire node pool was affected, requiring investigation across infrastructure layers to identify root cause and implement lasting remediation.

GenAI Observability in Grafana Cloud: End-to-End Agent Debugging (Demo)

From Observability for GenAI Applications (Grafana OpenTelemetry Community Call) We drill into traces to see which agents called which tools, where errors occurred, how long each LLM call took, and how costs and tokens are distributed. The walkthrough also covers using AI assistance to summarize long traces and identify optimization opportunities in real time..

AI Hosting: The Colocation vs. Cloud Dilemma for Your Next Project

Organisations running AI workloads, like banks training fraud detection models, hospitals testing diagnostic tools, or manufacturers using predictive analytics, all face the same problem: hosting them is costly and resource-intensive. They require dedicated GPUs running non-stop, vast amounts of data moving in and out, and far more power and cooling than a typical IT system.

AI in Production Is Growing Faster Than We Can Trust it

Enterprise software has moved past the generative AI testing phase. Businesses with millions of daily users or workloads are no longer just prototyping LLMs in a vacuum. They’re directly wiring agentic efficiency into product interfaces and infrastructure to stay competitive. This wave is often compared to the spread of microservices in the past, but we aren’t just adding new dependencies and complexity.

Engineering reliable AI agents: The prompt structure guide

The difference between an AI assistant that "almost" works and one that consistently delivers high-value results is rarely a matter of raw model capability. Instead, the bottleneck is typically the quality and structure of the instructions provided. For DevOps and SRE teams building automated workflows, "magical prompt tricks" are no substitute for a repeatable, engineered structure.

The Invisible Million Dollars and How AI Prevents Revenue Leakage

We have spent the last decade engineering our organizations for velocity. We optimized for "Land and Expand." We celebrated bookings. We built commercial architectures designed to intake revenue faster than we could operationalize it. In that era, operational friction was accepted as the cost of doing business. That era is over. The mandate has shifted from growth at all costs to efficient growth.