Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Error Budget in SRE: The Complete Guide (2026)

An error budget is the acceptable amount of unreliability permitted by your SLO over a defined time window. It is not a target. It is not a stretch goal. It is a hard ceiling that, when breached, should trigger a pre-agreed organizational response — feature freezes, postmortems, or infrastructure investment. The formula is blunt: Error Budget = 1 - SLO Target Error Budget (time) = (1 - SLO Target) × Window Duration For a 30-day window: That last number should make you uncomfortable.

Honeycomb Canvas: The Multiplayer Workspace for the Agentic Era

Last week, we launched a major update to Canvas, our investigation workspace. The new Canvas has evolved from an AI co-pilot you chat with to a place where your whole team, human and agent, can work the same problem on the same surface. Auto-investigations begin the moment a trigger, SLO, or anomaly fires. Custom skills encode your team's runbooks so every agent investigates with your team's expertise built in.

Introducing Atatus Sensitive Data Classifier

Your logs know too much. Every debug statement, every traced request, every APM span can carry the risk of capturing something they shouldn't. A customer email. A JWT token. A credit card number. An API key that was never meant to leave your payment service. It doesn't look like a breach. There's no alert. Your observability platform just quietly accumulates sensitive data like indexed, replicated, and accessible to every engineer with log query access.

How we made a SQL query optimization agent 59% more accurate using autoresearch and LLM Observability

Without experiment infrastructure to help you test your LLM applications, every research session starts with the same questions: What have we tried previously? What were the numbers? Which prompt version produced that result? Why did we discard that approach? The answers live in scattered notes, terminal history, and half-remembered conversations. Each handoff between sessions loses context. In practice, iteration can slow down as teams get bogged down in testing and analysis.

How to audit and clean up monitors effectively

Alert fatigue and blind spots develop together. Monitoring stacks that generate noise while missing critical issues may have incomplete coverage or poorly configured alerts. As they grow reactively and without structured coverage assessment, both issues worsen. Teams will often add monitors when something breaks and tune thresholds when alerts become unbearable, but rarely audit their overall setup to see if it works.

AI Powered IT Operations & Autonomous Resilience | Full SolarWinds Day Q2 2026 Event Replay

Watch the full SolarWinds Day 2026 event on-demand and discover how AI is transforming IT operations, observability, and incident response. In this exclusive event, SolarWinds CEO Sudhakar Ramakrishna and product leaders unveil the company’s vision for Autonomous Operational Resilience—powered by AI, automation, and unified visibility across hybrid and multi-cloud environments.

The $600 billion wake-up call: New Splunk research reveals downtime is a systemic business crisis

600 billion annual impact: Aggregate downtime costs for the Global 2000 have soared 50% in two years. $15,000 per minute: The average cost of downtime for organisations, highlighting the immediate financial impact of service disruptions. 3.4% stock price drop: The average decline in shareholder value following a single downtime incident.

12 IT Infrastructure Best Practices Every IT Leader Should Follow

Why do IT infrastructure issues continue to slow down teams even when tools keep improving? In most IT environments, the challenge is not a single failure. It is a set of ongoing operational gaps that are easy to overlook but difficult to control over time. A few of the common challenges include: In 2026, IT environments are more distributed and fast-changing than before. Hybrid infrastructure, cloud adoption, and strict compliance requirements make consistency harder to maintain.