Operations | Monitoring | ITSM | DevOps | Cloud

Designing for Failure: Choosing the Right Level of Redundancy, Resilience, and Control

Outages don't care how many zones you have. Power failures, software updates, and backbone disruptions all have one thing in common: they do not respect architecture diagrams. Redundancy only works if it is designed at the correct layer. Every team believes they are covered, and yet, when something breaks, the failure reveals that what looked like protection was only an illusion.

Supercharge Developer Productivity with the New Harness Code Experience

Smarter PR Reviews: Inline comments, keyboard shortcuts, and faster diffs reduce context switching. Optimized for Scale: Instant file tree and change listing performance even in large monorepos. Seamless Navigation: Effortlessly move between branches, commits, and repos without losing context. Unified Design System: A consistent, intuitive UI across the entire Harness platform. At Harness, we know developer velocity depends on everyday workflow.

What Are Kubernetes Nodes? Everything You Need To Know

A key advantage of Kubernetes for container management is its high scalability. Kubernetes nodes are directly involved in this, and they can significantly impact your efficiency, cost-effectiveness, and service availability. This guide provides an in-depth look at Kubernetes nodes, including types of nodes and operational best practices.

The Anti-Zombie, Battle-Tested Guide To AI FinOps: 10 Insights

When CloudZero’s CTO Erik Peterson joined the FinOps Weekly podcast in October 2025, he didn’t hold back. Instead of going on about the usual best practices of AI cost optimization, he posed challenges to how we approach AI spending. From “zombie AI experiments” eating your budget to why you should stop apologizing for using AI, these 10 insights from the podcast are worth considering in how we approach AI FinOps. (Watch the full podcast below and keep reading for more!)

How NRP Scales Global Scientific Research with Calico

The National Research Platform (NRP) operates a globally distributed, high-performance computing and networking environment, with an average of 15,000 pods across 450 nodes supporting more than 3,000 scientific project namespaces. With its head node in San Diego, NRP connects research institutions and data centers worldwide via links ranging from 10 to 400 Gbps, serving more than 5,000 users in 70+ locations.

Towards specialized efficient LLMs: Data Scaling Laws and Sparse Adapters

Welcome to the AI research bites. This series of short and informative talks showcases cutting-edge research work from ServiceNow AI Research team. The AI Research Bites are open to all, especially those interested in keeping up with the fast-paced AI research community.

Kentik in Motion: Unlocking the Power of Data Explorer

Kentik Data Explorer is the heart of Kentik, where raw network telemetry is transformed into actionable insights. Yet many users don’t realize just how much they can do with it, or how Data Explorer connects to other parts of the Kentik platform. In this session, we walk through the fundamentals of using Data Explorer effectively, provide real-world examples, and highlight how it ties into workflows such as alerting, dashboards, and troubleshooting.

Baking in site reliability with observability and AI: How SpotOn uses Grafana Assistant to keep restaurants running

When you operate a restaurant, the last thing you want to do is shut your doors and turn away guests and staff because of some technology failure. And if you’re the one providing that tech, it’s your job to make sure that doesn’t happen. “For us, observability is about a lot more than just dashboards and alerts.