Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Fixing a production error with the Flare CLI and AI, from discovery to deploy

Using the Flare CLI and its agent skill to find, fix, and resolve a production error without leaving the terminal. The AI agent looks up the latest error on freek.dev via the Flare CLI, analyzes the stack trace against the local source code, generates a fix, deploys it using bash mode, and marks the error as resolved in Flare. Learn more.

Observability Self-Hosted 2026.1 - Server Configuration Comparisons

In this video, SolarWinds Evangelist Chrystal Taylor introduces server configuration comparisons, a new feature in Observability Self-Hosted 2026.1 and Server Configuration Monitor 2026.1. The key highlight is the ability to compare server configurations side by side, enabling users to identify differences in configuration files between nodes or against a defined ideal state. This new functionality aims to help users monitor configuration drift.

Incident Report: Exercises, Cleanups, and Evacuations

Every year, Honeycomb runs disaster recovery scenarios in multiple environments, including in production. Although each of our instances runs in a single region, on at least three Availability Zones (AZs), we have multiple plans for partial regional failures, and particularly, zonal failures. One of these tests was run on December 5th, and after its successful completion came its cleanup steps.

Alerting Is a Socio-Technical System

In the previous posts, we’ve looked at how alert noise emerges from design decisions, why notification lists fail to create accountability, and why alerts only work when they’re designed around a clear outcome. Taken together, these ideas point to a broader conclusion. That alerting is not just a technical system, it’s a socio-technical one. Alerting systems encode assumptions about how people behave, how responsibility is distributed, and how decisions are made under pressure.

Catch Every Moment in Kubernetes: Splunk's Observability Advantage

Discover why real-time, unsampled observability is critical for Kubernetes environments with Stephane Estevez from Splunk at KubeCon Europe 2026. Learn how Splunk’s unique approach helps you catch every important moment—even when containers vanish in milliseconds. Watch now for expert insights on cloud-native monitoring, observability, and Kubernetes best practices!

Cut Costs, Not Visibility. Use S3 for Low-Cost Log Retention and Faster Response.

Why pay for continuous ingestion of data you rarely use? Learn how to maintain a lean data strategy by keeping long-term logs in cheap S3 storage, while retaining the power to "promote" specific slices into Splunk whenever an audit or investigation arises. See how Promote for Amazon S3 gives you the speed of local indexing without sacrificing speed in investigations.

AlphaFold, Office Politics, and Mustafa Suleyman's Two Futures (w/Benedict Lelijveld)

In this episode, Benedict Lelijveld joins us to unpack what it feels like to start a career in an era shaped by COVID disruption, hybrid work, and accelerating AI. We dig into his writing on Mustafa Suleyman and the idea of “pessimism aversion”: holding genuine hope for breakthroughs (from personal AI to advances in biology) while staying clear-eyed about risks like misuse, weak regulation, and who really benefits. Benedict also reflects on what early-career professionals lose when work becomes too remote—and why protecting your voice, curiosity, and craft matters more than ever as automation spreads.

Case Study - Troubleshooting Storage Failures in a VMware ESXi Infrastructure

IT problems happen even in the best architected infrastructure due to configuration changes, failures, upgrades and such. How quickly and effectively you can detect and resolve such problems dictates how efficient your IT operation is. Today, I’ll cover how eG Enterprise helped us troubleshoot a hardware failure (a storage battery failure) that that caused a cascade of failures in a VMware ESXi infrastructure.

Notes from the Field: XenServer falling back to file-based licensing when using LAS

Citrix has been transitioning products toward License Access Service (LAS) as the modern licensing method. Unlike traditional file-based licensing, LAS introduces service-based communication between products and the Citrix License Server. As of 15 April 2026, LAS becomes the mandatory licensing method for supported products. Environments still relying on file-based licensing will need to transition before that date.

Microsoft SCOM Tips & Tricks

This one is for all the Microsoft SCOM geeks out there — 99 practical tips & tricks to make managing SCOM way easier. The tips compiled here draw from community experts, SCOM-focused blogs, Microsoft’s official documentation, and the hands-on experience at NiCE. You may already know some of them, but having them all organized in one place makes it easy to reference and put them into practice.