Operations | Monitoring | ITSM | DevOps | Cloud

Reports just got smarter

We’ve upgraded the Reports page in StatusGator to give you more insight directly inside the StatusGator dashboard. Previously, reporting was limited to exports you could use to calculate your own uptime percentages and trends. Now, in addition to exported reports, you can view key reports and metrics without needing to download anything. We’ve also added a one-click download of the most commonly requested report: Uptime percentage by monitor.

Improved Microsoft 365 private status integration

Keeping track of your Microsoft 365 services just got easier. We’ve rolled out an update to the Microsoft 365 integration that removes manual setup and improves visibility. All services in your account can now automatically appear as components, so you can monitor them right away.

Top tips: When "sounds right" isn't right

Top Tips is a weekly column where we highlight what’s trending in the tech world today and list ways to explore these trends. This week, we’re looking at why convincing AI answers can still be wrong and how to catch them before they slip through. AI doesn’t fail the way it used to. It doesn’t give obviously wrong answers. It gives answers that are just right enough to trust. And that’s exactly why we stop questioning it. It fits into our workflow so easily.
Sponsored Post

Understanding the Three Pillars of Observability: Logs, Metrics and Traces

Many people wonder what the difference is between monitoring vs. observability. While monitoring is simply watching a system, observability means truly understanding a system's state. DevOps teams leverage observability to debug their applications, or troubleshoot the root cause of system issues. Peak visibility is achieved by analyzing the three pillars of observability: Logs, metrics and traces. Depending on who you ask, some use MELT as the four pillars of essential telemetry data (or metrics, events, logs and traces) but we'll stick with the three core pillars for this piece.

Notes from the Field: Keyboard mapping issues with IGEL Linux endpoints on Windows Server 2025 VDAs

New Windows Server versions often introduce subtle behavioral changes that only surface when interacting with different endpoint types. In mixed environments where both Windows and Linux-based endpoints are used, these differences can become more apparent. The following case highlights an issue encountered when using IGEL Linux thin clients against Windows Server 2025 VDAs, where keyboard input behaved differently compared to Windows endpoints.

Setting Up Server Monitoring for a Rails App on Hatchbox

Owning your server stack shouldn't be a source of anxiety. Unfortunately, it often is, especially if you only pay attention to the problems you can feel in your gut: Is the app running? Is it throwing exceptions? Does it seem fast enough? These are great intuitive measurements, but just as a doctor uses diagnostics to catch high blood pressure before it becomes a crisis, you need deeper visibility to detect memory leaks, CPU spikes, and disk consumption before they bring your project to a halt.

Bindplane Now Ships With a Native AI Skill - Bring Your Own Agent

Today we're rolling out the Bindplane AI Skill, a built-in capability of the Bindplane CLI (v1.98+) that teaches your favorite AI coding tool how to work with Bindplane — natively, accurately, and without the setup headaches of traditional integrations. Read Part 2 of the Bindplane AI Skill series to learn more about how we built it and how it works with real-life examples.

Moving On From MCP: How We Built the Bindplane AI Skill

If you've spent any time wiring AI coding agents into developer platforms over the last year, you've probably reached for MCP. We did too. And after enough sessions watching context windows balloon and tool calls misfire, we started looking for something different. This is the story of what we built instead — a native AI skill for the Bindplane CLI — and the engineering decisions behind it.

From Context to Commitment

If service-centric observability provides the control layer, the next question becomes more urgent. What happens when organizations pair context with automation that operates inside clear defined boundaries? During conversations at Nexus Live 2025, leaders did not describe automation as a futuristic aspiration. They described it as a necessary progression. However, the distinction they drew was important. Automation without context accelerates activity.

How to Test SQS Workflows Locally with LocalStack and OpenTelemetry

LocalStack lets you run SQS, Lambda, and S3 locally in Docker — but there's a hidden trap: OpenTelemetry's default AWS propagator doesn't work with free LocalStack. Here's how to set up end-to-end local testing with working trace propagation. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

VictoriaMetrics Virtual Meetup Q1 2026 - VictoriaMetrics Cloud Updates

VictoriaMetrics Cloud continues to mature as a secure, reliable, and cost-efficient observability platform. With PrivateLink now available across all regions, including Frankfurt, users can operate entirely without exposure to the public internet. Blue-green cluster deployments enable seamless, zero-downtime updates, while incremental backups ensure storage efficiency by capturing only what has changed. Operational visibility is improved with clearer alert states, showing Firing and Resolved conditions upfront. Security enhancements include stronger password policies and expanded authentication safeguards.

ActiveMQ MQTT Protocol Setup Guide: QoS, SSL, and IoT Scale

Modern enterprise architectures increasingly need to bridge the gap between resource-constrained IoT devices and heavyweight enterprise backend systems. ActiveMQ MQTT support makes this possible: devices running the MQTT protocol - sensors, actuators, edge nodes, publish telemetry on standard topics, while JMS-based backend services consume and process the data without any client-code changes.

Detect, Communicate, Resolve: Checkly's Agentic Workflow End-to-End

Coding agents are the fastest-growing audience for the Checkly CLI, and we're doubling down on them. In this session, Stefan hands Claude a real e-commerce app, lets it set up monitoring with `npx checkly init`, generate Playwright tests through MCP, and walk an actual alert end-to-end with Rocky AI in the loop.

What's New in InfluxDB 3 Explorer 1.8: Streaming Subscriptions, Smarter Sample Data, Line Protocol Validation, and Retention Controls

InfluxDB 3 Explorer 1.8 is all about writing data and keeping it under control. You can now subscribe to MQTT, Kafka, and AMQP streams directly from Explorer, generate custom sample datasets, stream live sample data continuously into your database, and validate your line protocol and preview the resulting schema before you write it. You can now also view and edit retention periods on both databases and individual tables.

Rollbar Pricing Explained: Plans, Features, and What You Actually Pay

You’re comparing error monitoring tools. You’ve narrowed it down to two or three options. Now you need to know what this actually costs before you bring it to your team. Here’s what Rollbar costs, what’s included at each tier, and how it compares to Sentry and Datadog on pricing. No sales pitch, just the math.

Faster fixes, less context sharing: how Grafana Assistant learns your infrastructure before you even ask

When an unexpected alert fires these days, most engineers' first move is to ask their AI assistant for help.You ask why your checkout service is slow and the assistant gets to work, but it can't get any meaningful insights—at least not quickly—without the proper guidance. So, the next thing you know you're sharing deals about your existing data sources, the services you have running, how they connect, which labels and metrics matter, and on and on.

Why dashboards still matter in the age of AI

I recently gave a talk at Experts Live India 2026 about SquaredUp, and even before getting into the demo, there was one question I knew I had to address: Is the dashboard era over? It's something we're all hearing more. "Just ask AI." "Agentic AI will build your dashboards automatically." "Why bother with static views when a chatbot can answer anything?" It's a fair question. Answering it requires a clear understanding of what a dashboard represents.

Your Team is Using Claude Code. Do You Know What It's Costing You?

The first two weeks of Claude Code are exciting. The third week is when you realize you don’t have visibility into what it’s doing or what it’s costing you. You would not run a production service without metrics, logs, and dashboards or deploy an API without knowing its latency, error rate, or cost per request.

GitHub Outages 2025 - 2026: Reliability Analysis and Outage History

Hashicorp's co-founder Mitchell Hashimoto decided to pull out his Ghostty project from GitHub in April 2026 due to GitHub's reliability issues. He did this after 18 years of using GitHub, saying that GitHub "is no longer a place for serious work". GitHub has experienced a significant decline in reliability over the past 6 months, and Hashimoto is not alone in expressing this sentiment.

Coralogix and Atlassian: Full-Stack Observability Inside the Incident Workflow

Incident response has a well-known efficiency problem. The tools teams use to detect and investigate issues are often disconnected from the tools they use to manage and resolve them. Engineers spend a significant portion of each incident switching between platforms, assembling context that should already be at hand. Even when the data is available, correlating signals across user, app, infrastructure, and security events to pinpoint a root cause remains manual and slow.

Digitate Named a Leader in the IDC MarketScape: Worldwide AIOps 2026 Vendor Assessment

SANTA CLARA, Calif. - April 29, 2026 - Digitate, a global provider of agentic AI platforms that enable autonomous IT operations, today announced its recognition as a Leader in the IDC MarketScape: Worldwide AIOps 2026 Vendor Assessment (#US54116226, March 2026). The evaluation assessed vendors across the global AIOps market based on both current capabilities and forward-looking strategy.

How to monitor external SaaS service outages

Modern infrastructure is no longer just about what you build and run internally. Most DevOps and system administration teams rely on a growing number of external SaaS services, including cloud providers, monitoring tools, authentication systems, CI/CD platforms, communication tools, and more. When one of these services fails, your application may still look healthy internally, while users are already experiencing issues.

End-to-End Trace Propagation Across SQS and Lambda with OpenTelemetry

SQS doesn't propagate trace context automatically. You instrument both sides, deploy, and get two disconnected traces. This post shows how to wire them into one waterfall — and the ESM format gotcha that silently breaks it every time. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

How to run a proof of concept that de-risks your monitoring decision

Part 3, key insights from a fireside chat with Chris Yates. Read part 1 here, and part 2 here. Most database monitoring proof of concepts (POCs) answer the wrong questions. Here's how to structure a proof of concept that genuinely de-risks your vendor decision with the questions to ask during the process. A POC is often treated as the final hurdle in vendor evaluation, but too often, it becomes theatre. A guided tour of the flashiest features, run by one person, under unrealistic conditions.

Two AI agents, one incident: Rocky AI comes to the terminal

A Playwright Check fails at 2 am. The login flow is broken. Until today, that alert triggered a human to get up, open the Checkly dashboard, copy Rocky AI root cause analysis (RCA), and then tell an agent to get to work. There were two AI agents, one incident, and no way for them to talk to each other. The extended checkly checks and new checkly rca CLI commands close that gap. Your coding agent can now pull Rocky AI's analysis into its ongoing work, read the diagnosis, and go fix the code.

New in the Honeycomb Academy: Learn to Use the Honeycomb MCP

Two things happen when engineers first connect the Honeycomb MCP to their AI assistant. The first is the blank page problem. The Honeycomb UI gives you something to react to: a heatmap, a query builder, a trace to click into. An AI assistant gives you a cursor and nothing else. When you don't know where to start, that's a hard place to be. The second shows up right after you get past the first one. You ask a question, you get a confident-sounding answer, and you're not sure whether to trust it.

Sentry + Stripe Projects: From Zero to Error Monitoring in Two Commands

No signup form. No dashboard. No copy-pasting DSNs. Sentry is now a provider on Stripe Projects, which means you can provision a fully configured Sentry project — error monitoring, tracing, and session replay — straight from the CLI in two commands. In this demo, we walk through the full workflow: initializing a project, provisioning Sentry, upgrading and downgrading plans, using magic login to jump straight into your dashboard, and letting a coding agent (Claude Code) handle it all for you.

From Vibes to Signals: Observing Your AI Coding Workflow

Agentic coding tools like Claude Code and Codex have taken centre stage and inserted themselves into the critical path of software development. This shift has happened fast, and for most teams, the visibility hasn’t caught up. Until now we’ve been evaluating our vibe coding the same way – on vibes. You might say “this feels faster” or “that seems like a better approach”. That’s not going to scale.

Connecting Agents for Real-Time Root Cause Analysis with Checkly's Rocky AI

Rocky, Checkly's AI agent, monitors production sites and provides an analysis for every failing check. Previously, a coding agent couldn't access this analysis, leaving incidents and agents disconnected. Now, you can access all the analyses via the Checkly CLI (or API) and tell your coding agent, "Hey, I got a Checkly alert. Please investigate!" With Rocky's structured analysis delivered inline, the coding agent can start with a strong hypothesis, fix issues, and propose a PR in one session.

LiveTail: Real-Time Visibility for Active Telemetry

See how Mezmo LiveTail helps teams move from passive log search to active, real-time investigation. In this demo, you'll watch live telemetry stream across services and environments, identify emerging issues as they happen, and use real-time context to troubleshoot faster before signals are delayed, buried, or lost in the noise. LiveTail is part of Mezmo's Active Telemetry platform — built for platform engineers and SREs who need immediate visibility into what's happening across their stack right now, not after the fact.

How Mezmo Uses Active Telemetry for Faster AI Root Cause Analysis

AI-powered root cause analysis only works when the data going into the model is clean, relevant, and structured. In this demo, we show how Mezmo's Active Telemetry approach helps engineers and SREs move from noisy application errors to immediate clarity. Using a restaurant ordering application running in Kubernetes, we trigger a database connection pool exhaustion issue and walk through two ways to investigate it with Mezmo.

See how Mezmo's AI Assistant instantly pinpoints root causes

This video shows how Mezmo's AI Assistant turns noisy telemetry into clear answers when errors spike. By preprocessing data and surfacing only the most relevant patterns, Mezmo quickly identifies issues like database connection failures or resource shortages and delivers actionable recommendations. Watch how AI-powered root cause analysis helps teams troubleshoot faster and with confidence. Mezmo's AI Assistant is built for platform engineers and SREs who need fast, reliable root cause analysis across high-volume telemetry pipelines — without manually sifting through noise.

Meet AURA: The Open-Source Agent Harness for Production AI : Autonomous Incident Response Demo

Watch AURA autonomously respond to a production incident in real time—from building its reasoning context and querying PagerDuty and ClickHouse, to triggering a human-in-the-loop approval with the on-call SRE, to removing the stuck pod and validating remediation. Every behavior is defined in a simple config. AURA is Mezmo's AI-powered incident response agent built for platform engineers and SREs managing high-volume telemetry pipelines.

How Kotak811 Revolutionized Digital Banking Observability with Coralogix

Kotak811, the digital-first engine of Kotak Mahindra Bank, is a banking platform serving over 23 million users across India. Since its launch in 2017, Kotak811 has transformed into the bank’s primary growth driver, now accounting for 70% of all new customer acquisitions. The platform is widely recognized for offering a paperless, mobile-first experience, providing everything from instant zero-balance accounts to seamless UPI payments and investment tools.

Meet Auvik AI: Bringing Practical Intelligence to IT Operations

Across the IT industry, AI is being positioned as the next evolution of operations. But for many IT teams, AI still feels disconnected from the tools they rely on every day. Dashboards get smarter. Reports get faster. But workflows stay the same. Stuck in vendor silos or a CLI, IT teams have been looking for ways to bolt AI into workflows, but what often comes out is a Frankenstein-like web of APIs and MCP hosts. AI is meant to make life easier for IT teams – not make it more difficult.

How Auvik AI Solves the Biggest Challenges in IT Operations

Modern IT operations aren’t short on tools. Monitoring tools. Ticketing systems. Alerting platforms. Documentation repositories. Dashboards. Scripts. Runbooks. And yet, when something breaks, the workflow still looks strangely familiar: Somewhere along the way you’re asking yourself: Is the problem even here? This is the everyday friction of IT operations. Not the big outages. Its the constant small mysteries that take far longer to solve than they should.

Securing the World's Biggest Machine: Critical Infrastructure, AI, and the Ethics of Innovation

What happens when decades of critical infrastructure experience meet today’s rapidly evolving AI landscape? In this episode, host Bob Slevin sits down with Ernie Hayden, award-winning author, former Navy nuclear officer, ethical hacker, and founder of 443 Consulting, for a deep dive into what it truly takes to secure modern, interconnected systems.

Two commands to Sentry: now on Stripe Projects

Two commands. That’s how little it takes to go from nothing to a fully configured Sentry project with error monitoring, performance tracing, and session replay: Click to Copy No signup form. No email verification dance. No dashboard tab-switching to copy-paste a DSN into your.env. Your account is created, your project is provisioned, and five environment variables land in your working directory, ready for your SDK to pick up. And if you’re using a coding agent?

Sentry's integration with Perforce is now generally available

If you work in game development, VFX, or any industry dealing with large binary assets, chances are your codebase lives in Perforce P4. It’s the version control system behind some of the biggest games and creative projects in the world — and until now, it’s been one of the last major SCMs without first-class Sentry support. Today, we’re changing that. The Sentry + Perforce P4 integration is now generally available for all Sentry organizations.

How Monitoring Tools Enhance Visibility Across Digital Platforms

There is growing confusion about what all the monitoring a business needs to do. As businesses enter new digital platforms to reach customers, they also need to establish monitoring of those new platforms in order to be successful. Of course, there are new digital platforms every day, including cloud services, websites, social media hubs and other customer service channels. While many of these platforms are always on, always collecting data for a business to mine, there is little in organization or technology to suggest that one person could monitor all of these platforms manually.

5 Best SOC 2 Continuous Monitoring Tools for SaaS: Closing the 20% Manual Evidence Gap

Landing a big-logo customer feels great-until their security questionnaire hits your inbox. For most B2B SaaS teams, SOC 2 compliance is the roadblock. You connect a tool, dashboards turn green, and then stall: about 20% of evidence still needs screenshots, sign-offs, or frantic Slack chases. That last-mile grind drags engineers back into spreadsheets just when the audit seems done.
Sponsored Post

"Proactive Insights for a Reactive World": What Makes Collective IQ Different for Business Leaders

From a business executive's perspective, the core question is not how many metrics a tool collects, but how clearly it connects technology to business productivity, cost, and risk. Dave Wagner summarizes this nicely: "if you're a business leader, what's really powerful about Collective IQ is it's not just technology metrics, it's productivity metrics."
Sponsored Post

Cost Control in SAP BTP: The Critical Need for Automation

The cloud is the cheapest processing you can buy... until you get the bill! Unfortunately, Cloud service costs are notoriously opaque when it comes to transactional and operations costs. The results can be unexpected bills and even damage to the ROI of cloud programs. SAP BTP is no exception, but it doesn't have to be this way. Good FinOps discipline is readily available for BTP - and beyond avoiding "bill shock" such monitoring is just good operational hygiene, preserving budget and resources for productive investment.

Context-Driven AI You Can Trust: How Edwin AI Earns Confidence in Production

Most legacy AIOps investments underdeliver because the AI lacks context, not capability. LogicMonitor’s latest innovations expand Edwin AI’s contextual intelligence across every dimension, so recommendations are accurate, explainable, and trusted by the teams that need to act on them. Reduce incident resolution time with AI that understands your environment—not just your alerts.

LogicMonitor Advances Autonomous IT with No Blind Spots, Trusted AI, and Closed-Loop Action

LogicMonitor’s latest innovations span the entire platform to deliver the operational foundation enterprises need for Autonomous IT—complete visibility from infrastructure to end user, AI that reasons in full context, and closed-loop automation that moves from detection to resolution. Over 90% of organizations rely on at least two to three monitoring solutions—and many enterprises operate five or more.

Monitoring Sidekiq Job Performance with AppSignal

When my Sidekiq job starts failing or slowing down, I often feel frustrated, especially if I don’t know how to fix it. If you’re using Sidekiq to run your background jobs, you know what I’m talking about. It’s a vital element of your stack, handling everything from data exports to password reset requests. It runs silently in the background, and most of the time, you’re not even giving it a second thought.

Getting Started with Home Assistant Webhooks & Writing to InfluxDB

If you’re already running or are familiar with Home Assistant, you’ve likely worked with integrations, maybe a few automations, and possibly MQTT as a way to wire devices together. But webhooks add another layer of flexibility that lets you level up your smart home into a fully-customized, intelligent network. Instead of relying on built-in integrations and being confined to the same local network, you can let external devices and services push events directly into Home Assistant.

Service-Centric Observability as the Control Layer

If distributed architectures have altered how systems degrade, then the way organizations model operational must evolve accordingly. Threshold monitoring evaluates individual metrics. Correlation clusters related alerts. Neither, on its own, explains how instability in one component alters exposure across an interconnected service landscape. In conversations at Nexus Live 2025, ScienceLogic’s annual customer conference, leaders described this distinction with clarity.

Why Runtime Visualization Is the Missing Link in Teaching Real-Time Systems

Guest blog by Florent Goutailler, Associate Professor, Télécom Saint-Etienne, France Teaching real-time embedded systems has always involved a fundamental challenge: the most critical behaviors – task scheduling, timing, and concurrency – are largely invisible at runtime. When students begin working with a real-time operating system such as FreeRTOS, they are introduced to concepts like scheduling, task prioritization, semaphores, and inter-task communication.

Secure performance testing at scale: Introducing secrets management for Grafana Cloud k6

To simulate real user behavior, performance tests often rely on API keys, tokens, or credentials to interact with real systems. But as your testing suite grows, this sensitive data can start to sprawl across scripts, configs, and environments, increasing the risk of exposure and making tests harder to manage and maintain. To address this challenge, we’re rolling out secrets management for Grafana Cloud k6, the fully managed performance testing platform powered by k6 OSS.

Get observability in the terminal, for you and your agents, with the gcx CLI tool

The way you write code is changing, which means the way you observe your systems and respond to issues needs to change, too. Engineers today spend much of their day working via command line, as agentic tools like Cursor and Claude Code have become highly effective at handling many day-to-day engineering tasks. This greatly accelerates code generation, but it doesn't solve for the context switching that comes when you have to jump into another tool that's not part of this new, faster workflow.

Icinga 2 Meets OpenTelemetry: Native Metrics Export in v2.16

The OTLPMetricsWriter is a new Icinga 2 feature available since v2.16 that exports check plugin performance data as OpenTelemetry-compliant metrics via the OTLP HTTP protocol. With a single configuration object, it connects Icinga 2 to any OTLP-compatible backend like Prometheus, Grafana Mimir, Datadog, Elasticsearch, VictoriaMetrics, and more.

Digitate is Positioned as a Leader in the IDC MarketScape: Worldwide AIOps 2026 Vendor Assessment

IT operations are in a new era – teams are expected to deliver always-on reliability, absorb constant change, manage runaway telemetry volumes, and still prove business impact. The IDC MarketScape: Worldwide AIOps 2026 Vendor Assessment (doc, March 2026) offers ITOps leaders a valuable lens on the AIOps landscape and the providers shaping what comes next.

State of Observability in Financial Services 2026: From implementation to business impact

The demands on financial services companies are intensifying rapidly. They must not only deliver seamless system performance but also control costs, secure sensitive data, and maximize the value of their observability investments. To navigate these converging pressures, leaders are evolving their approach to system monitoring and telemetry. The 2026 State of Observability in Financial Services research report reveals a fundamental shift in how organizations manage their digital infrastructure.

last9-genai: Closing the Conversation Gap in LLM Observability

OpenTelemetry's GenAI instrumentation gives you spans and token counts. It does not give you conversations, workflow cost rollups, or prompts visible in your dashboard. last9-genai is an OTel extension that fills those three gaps — without replacing your existing observability stack. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

How to Exclude Health Check Endpoints from Python OTel Traces

Health check endpoints generate thousands of identical, useless spans per day. Here are two production-ready approaches to filter them from your Python OTel traces — and the correctness trap most implementations miss. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Apache ActiveMQ High Availability Architecture: The Complete 2026 Guide

The most common Apache ActiveMQ high availability mistake is not a configuration error; it is a false assumption. Teams deploy two broker instances, point clients at both with a comma-separated URL, and label the topology "HA." Then the primary crashes, the secondary does not have the message state, and clients start throwing exceptions while the ops team scrambles.

Demo - Selector Platform Actionable Correlation

See how Selector turns fragmented alerts into actionable insight through intelligent correlation. In this demo, watch how events from across the environment are automatically connected, reducing noise and revealing the true root cause behind incidents. Instead of chasing isolated alerts, teams get a single, clear view of what’s happening and what to do next - faster. Built for network and operations teams who need to cut through noise and resolve issues with confidence.

Demo - Selector Platform Dashboard Validation

See how Selector enables real-time validation and visibility through customizable dashboards. In this demo, watch how teams can quickly monitor network and system performance, validate changes, and track key metrics - all in one unified view. Instead of piecing together data across tools, Selector delivers clear, actionable insights that help teams stay aligned and make faster decisions. Built for network and operations teams who need instant visibility and confidence in their environment.

Demo - Selector Platform CoPilot Diagnosis

See how Selector’s AI Copilot accelerates issue diagnosis in real time. In this demo, watch how natural language queries and AI-driven insights help teams quickly analyze incidents, surface root cause, and understand impact - without digging through multiple tools. Instead of manual investigation, Selector guides operators to answers faster, reducing noise and speeding up resolution. Built for network and operations teams who need clarity, speed, and smarter troubleshooting.

Demo - Selector Platform NOC Operator Workflow

See how Selector transforms NOC operations in real time. This demo walks through a typical workflow - from ingesting massive volumes of network and system data to automatically detecting anomalies, correlating events, and pinpointing true root cause. Instead of chasing alerts across siloed tools, Selector delivers a single, intelligent view - reducing noise, highlighting impact, and accelerating resolution.

The New Kubernetes Monitoring Experience in Splunk Observability Cloud

In this video, I walk through the three main pieces of the new Kubernetes monitoring experience in Splunk Observability Cloud: the Kubernetes overview page for monitoring the status and top issues across your environment, the Kubernetes Entities page for troubleshooting individual instances with correlated metrics, logs, events, and configuration, and the Workload Optimization view for getting actionable recommendations on your CPU and memory resource allocation.

What "AI-Ready Data" actually means for observability teams

Many organizations deploying AI are learning similar lessons right now: the challenge isn’t this or that AI model, it’s the data. According to Gartner, 60% of AI projects will be abandoned by organizations because of failures to support these projects with AI-ready data. Also, 63% of organizations either lack or aren’t sure they have the right data management practices to get there.

Misconfigured Alert Detection: Find the Alerts That Need Tuning

Netdata ships with hundreds of stock alerts. They cover a wide range of infrastructure conditions and they’re designed with sensible defaults. But “sensible defaults” and “correct for your environment” are not the same thing. A CPU threshold that’s perfectly reasonable for a build server might generate constant noise on a machine running batch jobs.

Certificate Discovery, Monitoring and Reporting | WhatsUp Gold 2026.0

Discover how WhatsUp Gold helps you identify and monitor certificates to reduce security risks, stay compliant, and avoid outages caused by expired or improperly configured certificates, featuring the latest reporting enhancements available in WhatsUp Gold version 2026.0.

Introducing Seer Agent: The answer is already in Sentry. Now you can ask for it.

This is a story about an engineer’s night that could have been bad, but ended up… not so bad. A few weeks ago, on a Saturday, our AI debugger, Seer, started failing. Note the big scary spike on the right. The errors were generic failures from the LLM calls, nothing that pointed at a root cause. Most of the team wasn’t scheduled to be on this weekend, and it just so happened Indragie, our Head of AI, was online. He started paging engineers.

LogicMonitor Advances Autonomous IT with No Blind Spots, Trusted AI, and Closed-Loop Action

LogicMonitor is advancing Autonomous IT with one platform that brings together complete visibility, AI with context, and governed action across the digital environment. In this announcement video, Andrew Keating shares how LogicMonitor is helping enterprises reduce blind spots, trust AI more, and move from detection to action. Modern IT teams are managing more complexity, more tools, and more noise than ever. That’s why LogicMonitor is bringing infrastructure observability, Internet performance, digital experience, and AI-driven operations together in one platform.

What are operational maturity levels (OMLs) for MSPs?

Service Leadership, a leading company that works to measure IT and managed service provider (MSP) performance, defines the five levels of operational maturity for solution providers. Often referred to simply as operational maturity levels (OMLs), OMLs help managed service providers (MSPs) measure how consistently, intentionally, and effectively they run their businesses.

Approaching the Parhelion

One early spring morning in 1535, the residents of Stockholm awoke to a most curious sight. Six suns lit up the sky, connected by bright halos, as immortalized in Vädersolstavlan, seen here. Today, we recognize these atmospheric effects as a parhelion (also referred to as ‘sun dogs’)—an illusion caused by light refracting off crystalline formations in the atmosphere.

Customize preconfigured views for AWS, Azure, and Google Cloud with Cloud Provider Observability in Grafana Cloud

Part of what makes Cloud Provider Observability in Grafana Cloud really useful is that it gives you prebuilt dashboards and drill-downs for AWS, Azure, and Google Cloud. Out of the box you get service overviews, instance-level views, and quick links to explore your data. However, you might already have dashboards you trust, want a view tailored to your team’s workflow, or need to change which panels show up when you drill into a single instance.

Argo Rollouts Canary Monitoring: Metrics, Gotchas, and Automated Gates with Last9

Argo Rollouts exposes Prometheus metrics on port 8090 — but the docs lie about which labels exist. Here's how to scrape them into Last9, build a canary dashboard, and use Last9 as an automated AnalysisTemplate gate, including the auth and base64 gotchas. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Live Runtime Investigation in Claude Code with Lightrun MCP

In this video, Lightrun’s Dan Putman demonstrates what happens when Lightrun MCP is integrated within Claude Code. See how, once activated, Claude can ask specific questions about what services it can see and instrument in order to perform a deep investigation in production to get to a validated root cause analysis without the friction of redeploying or switching contexts.

Debug Live Production Apps in Codex with Lightrun MCP

Lightrun’s Dan Putman demonstrates the power of the latest Lightrun MCP skill. Watch how your AI code agent can now debug live applications directly in production. By connecting OpenAI's Codex to real-time runtime data via the Lightrun MCP, engineers can now generate and validate hypotheses using live telemetry and snapshots, without breaking flow. Ready to bring runtime context to your AI agents?

Zero-config Go heap profiling

Coroot's node-agent already collects CPU profiles for any process on the node using eBPF, with zero integration from the application side. For Java, we dynamically inject async-profiler into the JVM to get memory and lock profiles. But Go processes were still a blind spot for non-CPU profiling unless the app exposed a pprof endpoint and the cluster-agent scraped it. We wanted the same zero-config experience for Go heap profiles. This post is about how we got there.

That's Not a Job for an LLM: The Right Way to Apply AI to Network Operations

LLMs have sucked all the oxygen out of the AI conversation — but AI is much more than just LLMs, and network engineers have been using AI techniques (machine learning, statistics, fuzzy logic, expert systems, neural networks) for decades. So what should LLMs be doing in network operations, what shouldn't they be doing, and how do agentic AI architectures fit in?

Not All Telemetry Requires Premium Pricing

Observability in software is often framed as a choice between self-hosted and SaaS: manage it yourself, or pay a vendor to handle your data. Both self-hosted and SaaS approaches have their merits, but assuming you must choose one exclusively over the other leads to poor trade-offs: either overcommitting to an all-in-one SaaS despite spiraling costs, or fully self-hosting when it’s unnecessary.

Azure Monitor Collector: Monitor Your Entire Azure Infrastructure From Netdata

If you’re running infrastructure on Azure, you’ve probably dealt with the split between your Azure-native monitoring and the rest of your stack. Your VMs, databases, and Kubernetes clusters generate platform metrics through Azure Monitor, but those metrics live in a separate world from the OS-level, application, and on-prem metrics you’re already watching in Netdata.

N+1 Queries in Rails: A Guide to Detection and Prevention

N+1 queries are the most common performance problem in Rails applications. ActiveRecord’s lazy loading means every belongs_to, has_many, and has_one association is a potential N+1 waiting to happen. The good news is that Rails gives you multiple ways to fix them, and tools like Scout can find them automatically. This guide covers everything a Rails developer needs to know about N+1 queries: what they are, how to fix them, how to prevent them in CI, and how to detect them in production.

Two years without cookies on the site, here's where we ended up

In January 2024, I wrote about removing all advertising cookies and user tracking from sentry.io. It was eight months into the decision at the time, and we were still figuring out what broke and what surprised us. That post struck a nerve: it became one of the most-read things we’ve ever published, probably because everyone building or running a product on the web was watching the same cookie deprecation timeline and wondering what would actually happen if someone just ripped the bandaid off.

Best Practices for a Smooth ERP System Implementation Experience

ERP system implementation requires precise coordination between planning, data handling, and system configuration. Each stage must follow a defined structure to prevent delays and maintain operational accuracy. Clear timelines, assigned responsibilities, and validated processes help ensure that deployment progresses without disruption.

What is AI SRE? The Complete Guide to AI-Assisted Site Reliability Engineering

It's 2:47 AM. PagerDuty fires. You open a Slack alert and see: p99 latency spike on checkout-service. You SSH into the host, check dashboards in four tabs, grep logs for the last 20 minutes, and eventually find a slow query introduced in a deploy six hours ago. It took 34 minutes. You resolved it, w Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Code Agents Need Observability

For those of us using tools like Claude Code, Codex, or Gemini, we already know they’re powerful. They can write code, refactor functions, open PRs, even run commands. For a lot of developers, they’re already part of the daily workflow. But once you zoom out beyond the individual developer, the biggest problem isn’t productivity. It’s control. AI coding tools are powerful, but they introduce a new, unpredictable cost layer that most teams don’t fully understand.

Capturing HTTP Request and Response Bodies in .NET Traces with PHI Redaction

> Standard OTel.NET instrumentation captures headers, status codes, and timing — not request or response bodies. Here's how to add body capture to your traces while keeping PHI out of your observability backend. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Cloud Security Best Practices Every Company Should Follow

Cloud adoption has accelerated dramatically over the past few years - and with it, so has the attack surface for cybercriminals. Whether you're a five-person startup or a 500-employee enterprise, moving your operations to the cloud without a solid security strategy is one of the most expensive mistakes you can make right now.

Introducing StatusGator's Accessibility Conformance Report (VPAT)

At StatusGator, accessibility is a core part of how we build and deliver our product. Today, we’re sharing our latest Accessibility Conformance Report (VPAT), which reflects our ongoing commitment to creating inclusive and usable experiences for everyone.

GitHub outage on April 23, 2026

On April 23, 2026, the first signs of trouble with GitHub did not come from its status page. They came from users. As reports began surfacing across developer communities, including discussions on Hacker News, engineers described failed workflows and unexplained server errors. At that point, GitHub had not yet acknowledged any issue. StatusGator, however, was already seeing the pattern and issued an Early Warning Signal at 14:33 UTC.

Fixing Broken Traces in GCP Cloud Run: A Custom OpenTelemetry Propagator

GCP's load balancer silently rewrites your traceparent header, orphaning spans in any OTLP backend. Here's the custom propagator that fixes it. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

From Keyword Search to Ask AI: How We Upgraded AppSignal's Docs Experience

Documentation search is often the last thing devs think about, until someone posts publicly that they couldn't find a basic answer, or your support queue fills up with things that are genuinely in the docs. We decided to get ahead of that. This is the story of how we went from a minimal keyword-only search on our docs to a conversational Ask AI experience.

Sentry + Claude Agents: Automatic Bug Fixes from Root Cause to PR

Seer, Sentry's AI debugger, automatically analyzes your issues and finds the root cause. Now you can pass that analysis directly to a Claude agent - a managed agent session in the Claude Console at platform.claude.com. Once it's done, a link to the branch appears in Sentry so you can review and merge the PR. This video walks through how the integration works and how to set it up in under two minutes.

What Is Mean Time to Resolve (MTTR)? (And How to Improve It)

Every minute a network incident goes unresolved costs your company money. Lost productivity, missed SLAs, degraded user experience, and, in other cases, direct revenue loss. For IT teams and network admins, the pressure to resolve incidents fast isn't just operational, it's existential.

Database Performance Monitoring: Query-Level Visibility Across 14+ Databases

Netdata has always collected database metrics: connections, throughput, replication lag, buffer cache hit ratios, and so on. These tell you that something is wrong, but they don’t tell you why. When your PostgreSQL response time spikes, the metric alone doesn’t tell you which query is responsible. For that, you’ve traditionally needed to SSH into the box, connect to the database, and run diagnostic queries manually. Or set up a separate database monitoring tool entirely.

How is Agentic AI fundamentally different from earlier automation?

Autonomous operations has been the goal for years. But most “automation” never got us there—it just helped teams keep up. Now that’s changing. Agentic AI introduces a fundamentally different model:– Purpose-built agents, not static workflows– Real-time decisioning, not predefined rules– Collaboration across agents, not isolated tasks Instead of automating steps, agentic AI enables systems to **reason, adapt, and act**—at a speed and scale humans simply can’t match. That’s what turns autonomous operations from a long-standing ambition into something actually achievable.

A Bettter Way to Run Network Operations: How Actionable Correlation Eliminates Alert Chaos

Anyone who has spent time in a NOC knows how quickly a routine issue can turn into a scramble. A user in a branch office reports that a critical application is unavailable. Slack starts lighting up, dashboards begin to fill with warnings, and before long several teams are trying to answer the same basic question at once: what exactly is broken, where is it broken, and who owns the next move?

13 Best Incident Management Software Compared in 2026

Every minute of downtime costs your organization money. Sometimes a lot of money. Gartner puts the average cost of IT downtime at roughly $5,600 per minute, and that number climbs fast when a major incident hits and your team is still scrambling to figure out who owns the problem. That’s where incident management software earns its keep. When something breaks at 2 a.m., you don’t want to be hunting through email threads figuring out who’s on call.

The Hidden Cost of DIY DevOps: Why Growing Companies Bring in the Experts

Companies are scaling faster than ever, but infrastructure rarely keeps up with the product. When developers take on operational work on top of everything else, it feels like a smart way to cut costs. In practice, it's one of the most expensive mistakes a growing software team can make. This article breaks down what DIY DevOps actually costs and how a structured approach changes the equation.

Top tips: When leaders leave, here's how to keep your IT systems stable

Top Tips is a weekly column where we look at what’s shaping the tech world and share practical ways teams can stay prepared for what’s next. This week, we’re focusing on a situation many teams underestimate—what happens to your IT systems when a key leader steps away, and how you can build stability that doesn’t rely on any one person. Some problems don’t show up when things are running smoothly. They show up when someone leaves.

When agents orchestrate agents, who's watching?

You used to monitor services. Then you started monitoring AI calls inside services. Now your AI agent is spinning up other AI agents to complete tasks. Your old monitoring instincts need to evolve. This isn't hypothetical. Agentic architectures are already in production. Coding agents are calling search agents; orchestrators are spawning specialized sub-agents for retrieval, planning, and execution. Teams are shipping these systems faster than they're figuring out how to watch them.

How Recurring Instability Turns into Clinical Trial Delays

In pharma, reliability becomes an operational priority because research and trial work depend on systems performing consistently across different teams, locations, and conditions. Much of that work sits inside scientific workflows, remote sessions, and compute-heavy environments where behaviour can shift with configuration or load. When that consistency starts to break down, teams keep moving, but time is lost in small increments across the day.

Why Your PromQL Availability Query Returns Nothing When Services Are Healthy

Your SLI query shows 100% availability as No Data. Here's why PromQL returns empty results instead of zero — and the label-preserving fix. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Take Control of Cloud Costs with Proactive Budget Alerts

Proactive budget alerts turn cloud cost optimization into an everyday operational practice. If you are responsible for managing cloud infrastructure, you already know the pattern. Costs creep up quietly, and by the time anyone notices, it is the end of the month and you are explaining instead of preventing overruns. According to Flexera’s 2026 State of the Cloud Report, 85% of their respondents say managing cloud costs is their number one priority for the year.

VictoriaMetrics at KubeCon Amsterdam: Community Highlights

KubeCon + CloudNativeCon Europe in Amsterdam brought together about 13,500 attendees this year, the largest turnout yet. The size of the event showed just how much the cloud-native space has grown, and how central observability, platform engineering, and cost control have become. For VictoriaMetrics, this year’s event was a mix of talks, booth conversations, and a lot of direct feedback from users.

Managing OpenTelemetry Semantic Convention Migrations With the Collector

Real production data tells the story better than I can. Juraci Paixão Kröhling, a friend and fellow observability practitioner at OllyGarden, recently shared an example from an anonymized production environment: 1,830 occurrences of http.url and 23,984 occurrences of url.full in the same dataset. Both attributes describe the same thing. Both are actively being written to the same backend at the same time.

Setting the Bar for Agentic NetOps

AI has quickly become part of the language of network observability. Many vendors across the observability landscape can describe, summarize, correlate, or explain some data or situation, leveraging basic LLM capabilities. At a distance, many of these offerings sound similar. They promise faster insight, efficient operations, and a more intelligent path through rising complexity. But the industry has reached a point where surface-level similarity is creating noise, not value.

Apache ActiveMQ vs Apache Artemis: The 2026 Definitive Guide

When engineers search for "Apache ActiveMQ vs Apache Artemis," most of what they find is either a shallow feature checklist or a confident recommendation to "just migrate to Apache Artemis." Neither helps a senior architect deciding whether to stay on a stable, battle-hardened Apache ActiveMQ deployment, or a platform team evaluating both options for a new system with clear eyes.

ActiveMQ Dead Letter Queue (DLQ) Management: The Complete Guide

If your Apache ActiveMQ deployment has a growing ActiveMQ.DLQ, you are not alone, and you are looking at the right problem. An unbounded, unmonitored dead letter queue is one of the most common root causes of "invisible" message loss in enterprise messaging environments. DLQ messages land without fanfare, nobody notices, and business-critical data quietly disappears from the processing pipeline.

What Is Wrong With PaaS Today?

In the wake of 2010s, PaaS felt like magic. You focused on the code, and the platform did the rest. You could ship a production app without knowing anything about networking or, heck, even what a load balancer is. Heroku in particular made deployment a lost thought, especially for early-stage companies. That era is somewhat over, not because platforms got worse overnight, but because the assumptions underneath them quietly stopped being true.

Announcing Icinga 2.16.0 and 2.15.3

We are happy to announce the release of two new versions of Icinga 2 today, 2.16.0 and 2.15.3. The first one includes some new features highlighted below, as well as a number of bug fixes and other improvements. The latter one is a small bug fix release that brings some of the other fixes included in 2.16.0 to the 2.15.x branch as well.

Test network paths with TCP, UDP, and ICMP in Datadog

When developers and SREs design application tests, they often prioritize user workflows and API availability. Extending that suite with network tests that match your app’s traffic protocols can reveal whether issues originate in the network or application layer. In this post, we’ll explore how you can design effective network tests using the Transmission Control Protocol (TCP), User Datagram Protocol (UDP), or Internet Control Message Protocol (ICMP), including.

The product signal latency gap slowing your growth

Organizations often call product managers the CEOs of the product. But PMs know that’s a myth. When a CEO wants a status report, they get one immediately. They don’t need to negotiate for engineering time, reconcile conflicting project priorities, or wait for a data scientist to find a gap in their schedule. For most PMs, simply understanding the state of the product is where growth can stall.

VictoriaMetrics Virtual Meetup Q1 2026 - VictoriaMetrics Updates

VictoriaMetrics continues to enhance usability and developer experience with new built-in capabilities. A lightweight UI now provides clear client setup instructions, simplifying onboarding, while an integrated inspector offers powerful debugging tools directly within the platform. Default tenant configuration further streamlines initial setup, reducing friction for new deployments. In addition, the MCP Server is now included by default in VictoriaMetrics Cloud deployments, eliminating the need for manual installation and making advanced monitoring workflows more accessible out of the box.

AI agents are only as smart as the data you feed it

AI is only as useful as the context you give it. An autonomous observability agent can unlock serious value from your telemetry, but only when the foundation is right: good telemetry, a strong data layer, and efficient access to the data. Annie Freeman and Lewis Isaac had a lot to say about this at AWS Summit London this week! hashtag#Observability hashtag#AI hashtag#AWSSummitLondon hashtag#DevOps hashtag#OpenTelemetry.

Beyond Uptime: Building a Self-Healing OpenClaw Observability Stack

The allure of OpenClaw is undeniable. You deploy a highly autonomous, self-hosted AI agent, give it access to your repositories and inboxes, and watch it reason through complex workflows while you sleep. It is the dream of the ultimate 10x developer tool realized. But as any veteran DevOps engineer will tell you: running an LLM-backed Node.js agent in production is vastly different from testing it on your local machine.

Observability Focus: Why It Became the Default Language of Modern IT Operations

Digital services run on fragile highways of microservices, containers, and event streams. Outages no longer hide inside a single server rack; they ripple across regions and ruin brand trust in minutes. Because uninterrupted insight now decides whether a launch soars or stalls, engineers treat observability as the vocabulary for every architectural choice, deployment ritual, and post-incident review. Similar discipline emerges in studios that refine professional end-to-end game dev workflows, where frame drops and lag spikes receive the same diagnostic rigor expected of banking APIs.

AWS Outage History: What Engineering Teams Should Learn

If you've been running production workloads on AWS for more than a year, you've felt it: the 3 am PagerDuty alert, the scramble to check the AWS console, the frantic Slack thread asking, "Is this us or is this AWS?" And then, minutes or hours later, the AWS Service Health Dashboard finally acknowledges what your users have been experiencing all along. It happens because AWS is the backbone of modern infrastructure.

Top 13 Prometheus Alternatives in 2026

Prometheus is a widely adopted open-source monitoring and alerting toolkit, popular among DevOps and SRE teams for its robust metrics collection and powerful query language (PromQL). It is fast, reliable, and purpose-built for modern, cloud-native environments. However, Prometheus may not suit all teams or projects. In 2025, several alternatives offer different strengths that might better match your specific monitoring needs.

Alloy, OpenTelemetry & Instrumentation Community Call LIVE from GrafanaCON 2026

Join us live from GrafanaCON 2026 for the Alloy, OpenTelemetry & Instrumentation Community Call! We’re kicking things off with a look at everything happening across Alloy and the OpenTelemetry ecosystem, alongside special guests Ted Young, Mischa Thompson, and Liudmila Molkova. In this session: We take a look back at Alloy’s rapid growth and adoption Explore the introduction of the new OpenTelemetry Engine Dive into fleet management, instrumentation, and onboarding at scale.

Loki Community Call LIVE from GrafanaCON 2026

Join us live from GrafanaCON 2026 for the Loki Community Call! We’re kicking things off with a look at everything happening in the Loki ecosystem, alongside special guests Poyzan Taneli, Ben Clive, and Trevor Whitney. In this session: We take a look back over the last year in Loki Explore the brand new “Thor” architecture Dive into what’s coming next for logging at scale From a completely new columnar storage format and Kafka-based ingestion, to a redesigned query engine and improved support for high-cardinality data—Loki is evolving to meet the demands of modern logging.

Pyroscope Community Call LIVE from GrafanaCON 2026

Join us live from GrafanaCON 2026 for the Pyroscope Community Call! We’re kicking things off with a look at everything happening in the Pyroscope ecosystem, alongside special guest Alberto Soto. In this session: We take a look back over the last year in Pyroscope What’s new in continuous profiling What’s coming next From multi-language source code integration and symbolization improvements to OpenTelemetry profiles and performance gains, Pyroscope has evolved rapidly over the past year.

Modernizing a legacy CMake build-system

CMake tends to have a bad reputation for being to complex and convoluted, but often that notion stems from very old versions of CMake. Sure, CMake is a Turing-complete scripting language, but that is really needed for an ecosystem as complex as that of C and C++. And as Greenspun’s tenth rule of programming goes: There are countless build-systems and build-system generators for the C/C++ ecosystem. Some of them tried to use a simple, declarative approach.

Nagios Plugins Collector: Run Your Existing Checks and Custom Scripts Inside Netdata

A lot of teams have a collection of Nagios plugins and custom monitoring scripts that have been running reliably for years. Some are standard community plugins for checking disk health or SSL certificate expiry. Others are homegrown Bash or Python scripts that check something very specific to the business: whether an API endpoint returns the right payload, whether a batch job completed on time, whether a queue depth is within bounds.

New: SSL Certificate Monitoring, Security Center, Domain & SSL Expiration Tracking - Plus Our Affiliate Program

DNS Spy now goes well beyond DNS record monitoring. We've shipped SSL certificate discovery and security auditing, expanded the Security Center to 40+ automated checks across six categories, and built expiration tracking for both domains and SSL certificates — with tiered alerts so nothing expires without warning.

Turn developer feedback into operational insight with Datadog Forms and Sheets

Engineering organizations rely heavily on developer feedback to improve internal platforms, tooling, and processes. However, that feedback is often scattered across disconnected systems such as external forms, spreadsheets, chat threads, and documentation tools. Because these systems are separate from operational data, teams struggle to correlate developer sentiment with measurable performance or reliability outcomes.

Why Enterprise AI Demands More Than Just Automation

Based on insights from The Intelligent Enterprise podcast, “The Evolution from Automation to Autonomy” Every couple of weeks, The Intelligent Enterprise podcast steps away from the day-to-day noise of enterprise life to explore big ideas from a fresh perspective. In one recent episode, the focus turned to a question many organizations are still grappling with: What does it really take to build an AI-powered enterprise that works with people, not against them?

Why Alert Fatigue Is Killing Your MTTR

Every minute counts when production systems go down. Yet the average enterprise NOC team receives over 1,000 alerts per day, according to a 2025 study by OpsRamp. Of those, fewer than 5% require human intervention. The rest? They are noise — redundant, low-priority, or symptomatic signals that bury the genuine incidents demanding immediate attention.

How to Use Time Series Autoregression (With Examples)

Time series autoregression is a powerful statistical technique that uses past values of a variable to predict its future values. This approach is particularly valuable for forecasting applications where historical patterns can inform future trends. In this hands-on tutorial, you’ll learn how to implement autoregressive (AR) models using Python and see how InfluxDB can enhance your time series analysis workflow.

Episode 10 - How I Learned to Stop Worrying and Love AI

Are we still in the first chapter of AI, and mistaking it for the whole story? In this episode of The Intelligent Enterprise, host Tom Stoneman zooms out from the headlines to explore where we really are in the AI journey. He’s joined by journalist and independent analyst Joe McKendrick, who has spent decades documenting how emerging technologies reshape business and society. As co-chair of the AI Summit in New York and a senior contributor to Forbes and ZDNet, Joe brings the perspective of someone who understands how these stories unfold over time.

Join operator and Query Agent for smarter log analysis

Sumo Logic’s log analytics capabilities have always provided the greatest insights to help you secure, monitor and troubleshoot your environment. Now, with our Query Agent, as part of Dojo AI, creating optimized log searches with natural language is even easier. Query Agent works with a wide variety of operators, including the join operator, for parsing, aggregation, data transformation, filtering, advanced analysis and lookup.

The New Economics of Enterprise AI: Why Small Models Win Where It Matters

For years, progress in AI was equated with scale. Larger models, broader parameter counts, and increasingly complex cloud architectures were treated as signals of advancement. In enterprise operations, however, scale alone does not determine success. Economics does. As AI becomes embedded in operational workflows, organizations are discovering that model size is less important than cost stability under continuous load. AI-driven operations do not run in bursts. They run constantly.

Bridging IT and OT: Lessons from the Factory Floor with Steve Goudreau

Everyone’s rushing to AI, but few have the foundation to make it work. In this episode of Next Gen Network Heroes, Bob sits down with Steve Goudreau, Director of IT at Ice Industries, to explore what it really takes to lead in today’s evolving technology landscape. With over three decades of experience, spanning military service, financial services, and manufacturing, Steve brings a grounded, people-first perspective to an industry often obsessed with tools and trends.

DataPrime at Ingest: Fine-Grained TCO Routing with DPXL

The real economic decision for observability happens at ingest, before storage, billing, and retention choices are locked-in. Until now, the logic governing that decision could only see three broad fields: application, subsystem, and severity. That just changed. TCO routing now matches on any field in the event payload, including nested keys, custom fields, and event body content, using DPXL, the DataPrime Expression Language.

What is Network Monitoring? Why Every IT Team Needs It (2026)

Learn what network monitoring is and why it’s critical for IT teams in 2026. Discover how it works, key metrics to track, and how to prevent downtime before users are impacted. Modern IT environments are complex—network monitoring helps you detect issues early, reduce downtime, and keep your infrastructure running smoothly. Watch now and monitor your network with confidence. Don’t forget to like, share, and subscribe for more IT insights.
Sponsored Post

From Microsoft SCOM to Dashboards

System Center Operations Manager (SCOM) remains one of the most capable on-premises monitoring platforms for Microsoft environments. However, as IT operations evolve toward real-time observability and self-service insights, traditional SCOM reporting and consoles can feel restrictive. This whitepaper explores practical ways to extend and modernize your SCOM visualizations using today's leading dashboarding technologies - including SquaredUp, Grafana, Power BI, and Azure Workbooks.

Moving Beyond SolarWinds: Building a Modern Observability Strategy

For years, platforms like SolarWinds have been a standard in IT environments. They helped teams answer a fundamental question: are systems up or down? That approach worked well when environments were more contained and predictable. The challenge is that most environments no longer operate that way. Hybrid infrastructure, cloud services, and tightly interconnected applications have changed what “visibility” needs to mean.

New: More control with Recovery Notices

We’ve added a new notification option to give you more control over how and when you get alerted: Recovery Notices. Until now, notifications were primarily focused on incidents – letting you know when something goes wrong. But we heard from many of you that not all alerts are equally useful. While some teams want full visibility across the entire lifecycle of an incident, others are mainly concerned with when a service goes down, not when it comes back up.

Forget user experience, the age of user extraction is here

Does it ever feel like the days of simple, user- and pocket-friendly digital services are now a bygone era? Is everything just a reminder of how things used to be better? Dramatic language and rose-tinted glasses aside, you would be naive not to notice that service providers are becoming increasingly predatory, especially when it comes to monetization. Ads are everywhere, privacy policies are questionable at best, and costs keep rising.

Instrumenting WordPress with OpenTelemetry: PHP Tracing, Browser RUM, and Error Capture in Production

WordPress powers 40% of the web but has no native observability story. Here's how to instrument it end-to-end with OpenTelemetry - PHP, browser RUM, and errors. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

No more monkey-patching: Better observability with tracing channels

Almost every production application uses a number of different tools and libraries,whether that’s a library to communicate with a database, a cache, or frameworks like Nest.js or Nitro. To be able to observe what’s going on in production, application developers reach out for Application Performance Monitoring (APM) tools like Sentry. But there’s an inherent problem: the performance data that APM tools need is most often not coming natively from the libraries themselves.

GrafanaCON 2026 announcements: A guide to all the latest news from Grafana Labs

GrafanaCON 2026 kicked off in Barcelona, which is a fitting city to reveal the latest updates in Grafana 13. In 2013, Grafana Labs Co-founder Torkel Ödegaard made the first commit for what would become Grafana while he was on vacation in the Catalan city. "I was traveling here for the Christmas holiday and I got a cold and spent most of the day in bed coding and working on Grafana," said Torkel during the opening keynote of GrafanaCON, our biggest community event of the year.

AI Observability in Grafana Cloud: A complete solution for monitoring your agentic workloads

The observability industry has developed great tools for using metrics, logs, traces, and profiles to monitor the cloud native applications that have dominated the last decade of software development. But when it comes to understanding what an AI system is actually doing, we’re often left reading raw conversations, guessing at quality, and reacting too late. And that’s a problem.

Introducing o11y-bench: an open benchmark for AI agents running observability workflows

Evaluating agents is hard. Verifying observability tasks is harder. Yes, AI agents have gotten dramatically and quantifiably better at coding and tool use, but observability presents a different kind of challenge. In a real incident, the hard part is rarely just writing a query. It's deciding which signal matters, figuring out whether a spike is noise or symptom, correlating metrics with logs and traces, and sometimes making a change in Grafana without breaking the dashboard another engineer depends on.

Grafana 13 release: get value from your data faster, manage operations at scale, and more!

Who says 13 is unlucky? With the release of Grafana 13, we're giving the community the most streamlined, flexible, and intuitive Grafana experience yet. Unveiled during the opening keynote of GrafanaCON 2026, the latest major release is all about helping you get value from your data faster, whether you’re spinning up dashboards, operating Grafana at scale, or extending the platform as your requirements change. Download Grafana 13.

Why Threshold Monitoring Fails in Distributed Systems

For years, infrastructure stability could be approximated through static limits. If CPU utilization exceeded a defined percentage or response time crossed a fixed boundary, risk was assumed to increase in a predictable way. Monitoring systems were designed around that assumption, and for contained environments, it largely held true.

Identify and fix code issues faster with Datadog's Azure DevOps Source Code integration

Developers and SREs who rely on Microsoft Azure DevOps often face fragmented workflows when investigating issues or reviewing code quality. Troubleshooting an error can require jumping between observability tools and source code repositories as you manually connect traces, stack frames, and commits. At the same time, security vulnerabilities, misconfigurations, and flaky tests may go undetected until later stages of the software delivery life cycle (SDLC), where they are more costly to fix.

Bringing observability data hosting to the UK on AWS

UK organizations are increasingly required to design systems that account for data residency requirements, ensuring that operational data remains within national boundaries. Many teams already run their applications on AWS infrastructure in the UK, but telemetry data can still be processed outside the region, creating gaps in visibility. Datadog’s upcoming UK availability zone solves this by keeping telemetry data in the same region as the workloads that generate it.

Introducing the ChangeTower Website Monitoring Chrome Extension

Setting up website monitoring has always meant a small but annoying detour. You spot a page worth watching, copy the URL, switch tabs, log into your monitoring tool, paste, configure, save. By the time you’re done, you’ve lost whatever train of thought sent you there in the first place. We’re fixing that. Today we’re excited to announce the ChangeTower Chrome Extension — now open for waitlist signups.

Monitoring CPU and Memory on Your VPS with AppSignal

Most of us run multiple virtual private servers (VPS) at a time. That’s why it’s important to keep an eye on the CPU usage and memory. However, since this step often slips our minds, there is room for automated monitoring. Open-source tools tend to be a default choice, and for a good reason. The problem is that they don't provide everything you need for monitoring in a single place. As a result, you may find yourself writing custom shell scripts for automation.

Git Sync: Observability as code built for scale | Demo | Grafana Labs

In this video, Fabrizia Rossano and Roberto Jiménez demonstrate Git Sync, a feature that provides you with the power of Git version control right in your Grafana instance. Git Sync enables you to submit changes in your dashboards as pull requests and get them reviewed by your team directly from Grafana or from Git.

Grafana 13 TL;DR - What's New (and Worth Your Time)

Grafana 13 is here! In this video, we walk through the biggest updates and improvements, from faster ways to build dashboards to new features that make Grafana easier to manage at scale. We cover things like: If you’ve ever struggled with broken dashboards, messy layouts, or just getting started from scratch, this release focuses on making those workflows a lot smoother. This is a TL;DR, so we’re just scratching the surface—but it should give you a solid sense of what’s new and what’s worth checking out.

The Modern Messaging Primer: Navigating the Shift from Legacy Middleware to Open Source Innovation

The shift from legacy middleware to open-source innovation promises agility and cost savings, but introduces the 'Modernization Tax'—operational complexity that requires new approaches to observability, governance, and management across hybrid messaging environments.

What's New in VictoriaMetrics Cloud Q1 2026? Logs, MCP Server, Better Alerting, and... a Secret Project

Q1 2026 has been one of our most eventful quarters yet for VictoriaMetrics Cloud. We shipped something we have been building towards for a long time, crossed a few infrastructure milestones, and started clearing the path for what is coming next to the most performant observability stack.

Release v2.10: Secrets Management, Nagios Plugin Collector, Azure Monitor, and more

What’s New in Netdata v2.10 In this release, Netdata brings powerful new capabilities to help you monitor, troubleshoot, and understand your infrastructure faster without complexity. In this video, we walk through the key updates: Secrets Management – Securely manage sensitive configuration data Nagios Plugins Collector – Extend monitoring using existing Nagios plugins Azure Monitor – Bring Azure metrics into Netdata for unified visibility.

What is Application Performance Monitoring (APM)?

A modern web application is not a single thing. A single user request may touch a web server, a database, a cache layer, and several third-party APIs before a response comes back. And as AI tools generate more and more application traffic (API calls, background jobs, automated workflows), the volume and unpredictability of that traffic is growing. When something goes wrong, it could be any of it. When something is slow, it could be all of it at once.

Grafana Assistant everywhere: Customize and connect to the AI agent to fit your specific needs

The ways you and your teams build and observe your systems are changing. It’s no longer just engineers looking at dashboards, or writing queries or config files. More often, it’s an agent interacting with the data, too, helping write code, run applications, investigate incidents, rightsize deployments, and more.

Introducing Pyroscope 2.0: faster, more cost-effective continuous profiling at scale

Continuous profiling is becoming a standard part of the observability stack, and for good reason. It's the only signal that tells you why your code is slow or expensive, not just that it is. Metrics tell you CPU usage is high. Logs tell you a request was slow. Traces tell you which service is the bottleneck. But only a profile tells you which function, on which line, is burning the cycles. As systems grow more complex, that level of visibility becomes essential.

Building the AI Stack for Modern Network Operations - Surya Nimmagadda

AI is rapidly transforming network operations — but what does it actually take to build an AI stack that works in production? In this session from AI for Network Leaders – Powered by Selector, Surya Nimmagadda breaks down how modern AI systems for network operations are designed, deployed, and used today. He covers: This session is designed for network engineers, architects, and operators looking to move beyond theory and understand how AI is being applied in real production environments.

Frontline Truths: 100+ Network War Stories on the Path to Autonomous Operations - Eric Chou

The path to intelligent network operations isn’t a straight line. In this session from AI for Network Leaders – Powered by Selector, Eric Chou shares hard-earned lessons from over 100 conversations with network engineers and operators navigating automation, complexity, and the shift toward AI-driven operations. He covers: This session is a practical field guide for teams looking to move from reactive firefighting to building an AI-ready network foundation.

You Don't Have an AIOps Problem-You Have a Data Opportunity - Michael Wynston

AI can’t fix bad data. In this session from AI for Network Leaders – Powered by Selector, Michael Wynston breaks down a critical truth: the success of AIOps depends on the quality, consistency, and trustworthiness of your network data. Using real-world lessons from Fiserv’s large-scale network transformation, he explores how teams can build a strong data foundation that enables AI to deliver meaningful, low-noise outcomes.

Inside the AI Agents Transforming Network Operations - Joby Rudolph & James Schnebly | Selector

AI agents are becoming a core part of modern network operations — but what does it actually take to build and deploy them effectively? In this session from AI for Network Leaders – Powered by Selector, Joby Rudolph and James Schnebly break down how AI agents are designed, implemented, and applied in real-world network environments. They cover: This session provides a practical look at how AI agents are moving from concept to production — and what it takes to make them work at scale.

Automate Network Discovery and Mapping with SolarWinds Network Topology Mapper

SolarWinds Network Topology Mapper (NTM) helps you automate network discovery and mapping, saving man hours. With a variety of discovery methods like SNMP, CDP, ICMP and WMI, NTM helps you have an up-to-date map of all your routers, switches, firewalls, servers, desktops, and workstations. SolarWinds Network Topology Mapper enables you to export the maps to a variety of formats including Visio, PNG, Network Atlas and PDF for easier documentation. With NTM, you can have up-to-date network diagrams to comply with PCI, HIPAA and other regulatory requirements.

Automate, Create and Export Network Maps to Visio with SolarWinds Network Topology Mapper

With a variety of discovery methods like SNMP, CDP, ICMP and WMI, SolarWinds Network Topology Mapper helps you have an up-to-date map of all your routers, switches, firewalls, servers, desktops, and workstations. Featuring industry standard symbology for network nodes, NTM supports multiple options for map alignment with improved Etherchannel support and representation, ability to create multiple maps from a single scan and many more.

Fast AI Feedback Loops with Honeycomb and OpenTelemetry

Are you writing agentic applications, but aren’t sure what the agents are doing? Finding out too late that you've blown the budget with super expensive models? Not sure where the agents are failing, and feeling a loss of control? Could they do better? Observability is the visibility you need to get the job done. Sending telemetry to Honeycomb explains what your agents are actually doing.

From Edge to Enterprise: How Litmus and InfluxDB Are Modernizing the Industrial Data Stack

Today at Hannover Messe, InfluxData is announcing a strategic partnership with Litmus to address one of the most persistent challenges in industrial data: getting reliable, contextualized telemetry from the shop floor into production systems. Litmus bridges the gap between OT systems and modern IT infrastructure, while InfluxDB serves as the industrial data hub, giving organizations both real-time operational visibility and enterprise-scale historical analysis in a unified architecture.

AppSignal x Hatchbox: Affordable Hosting, Full Visibility

Affordable hosting has always been a puzzle. Heroku made deploying Rails apps simple, but with Salesforce at the helm, active development has stalled. Many developers are left wondering what comes next, locked into a platform that is no longer moving forward. Chris, the founder of GoRails, felt that same frustration. That is why he built Hatchbox. Hatchbox handles your deployments, runs on servers you own, and keeps costs predictable. No dyno management, no add-on sprawl.

Secrets Management: Get Credentials Out of Your Netdata Configuration Files

If you’re running Netdata collectors that connect to databases, APIs, or other authenticated services, there’s a good chance you have passwords sitting in plain-text configuration files right now. It works, but it’s the kind of thing that makes security teams nervous and makes credential rotation painful. Every password change means editing config files and restarting collectors.

Progress Flowmon Roadmap 2026 and Beyond

Explore what’s ahead for Progress Flowmon in this roadmap session presented by Head of Product Nick Vlasov. Learn about upcoming innovations in AI‑driven analytics, automated investigation playbooks, detection tuning improvements and long‑term platform direction. Perfect for network engineers, security analysts and IT leaders looking to strengthen visibility, performance and security. Watch now to see what’s coming next.

How to solve key site reliability engineering challenges

Modern site reliability engineering challenges stem from the difficult requirement of confirming why complex systems fail in ways staging cannot replicate. While observability tools signal failures, and AI SREs reason over data, they leave observability gaps regarding the actual state of running code. By utilizing runtime context, teams capture live execution data to accelerate production debugging, resolving incidents in minutes without requiring manual redeploy cycles.

Monitor Databricks with Grafana Cloud for instant visibility into your workloads

If you're running Databricks workloads, you've probably asked yourself these types of questions: How much is this costing me? Why did that job fail last night? Why are my dashboard queries suddenly slow? We've been there, too. Databricks is fantastic for data engineering, ML, and analytics. But once you start running jobs, pipelines, and SQL queries at scale, you need a way to keep tabs on what's happening. That's why we built the Databricks integration for Grafana Cloud.

How Observability Powers Autonomous IT in Hybrid Environments

Autonomous IT only works when observability gives it the context to act with confidence. On any given day, a mid-size enterprise generates tens of thousands of alerts across on-prem infrastructure, multiple clouds, SaaS tools, Internet dependencies, and AI workloads. Most of them don’t need a human. A few of them do. Telling the difference, fast enough to matter, is exactly where IT teams are losing ground.

Uptrace MCP Server: Auto-Generate Dashboards with AI in Minutes

Tired of clicking through menus to build observability dashboards? In this video I walk through how to configure the Uptrace MCP (Model Context Protocol) server and connect it to an AI assistant so your dashboards get created automatically from natural-language prompts. You'll learn how to: By the end you'll have a working setup where describing what you want to monitor is enough to get a real, shareable dashboard in Uptrace.

Observability is a design problem: Live Laugh Logs ep. 1 - KubeCon Amsterdam 2026

What happens when 20,000 engineers descend on Amsterdam to talk about Kubernetes and AI? Welcome to Episode 1 of Live Laugh Logs, the podcast from Annie, Lewis and Andre from the Coralogix Developer Relations team where we will get together and recap everything going on in our worlds! We had an amazing time at KubeCon in Amsterdam and had loads of insights from the talks we went to around designing observability systems, all the AI tools being created and how to observe them, and using agent-generated code.

Building Audit-Ready Observability for Digital Banking

Most observability platforms are built to answer one question: what’s broken right now. Regulators are asking a different one: what happened, exactly, and can you prove it? Digital banking operates under constant regulatory scrutiny, where frameworks like DORA, PCI-DSS, and GDPR require every incident to be fully reconstructed across systems, timelines, and access. Systems can recover quickly, but the ability to explain what happened often remains fragmented across tools and teams.

From Tools to Teammates: A Practical Framework for AI Agents in Network Operations - Du'An Lightfoot

AI agents are quickly moving from experimentation to real-world deployment in network operations — but how do you adopt them without introducing unnecessary risk? In this session from AI for Network Leaders – Powered by Selector, Du’An Lightfoot shares a practical framework for building and deploying AI agents in production network environments. He covers: This session cuts through the hype and provides a clear, actionable model for teams looking to move from AI as a tool to AI as a teammate.

Where is your business wasting time & money?

Whether you have a new startup or an established company, it is very likely that your business is losing time and money. Worse still, it's likely happening in multiple places. Thankfully, if you are prepared to identify and address those issues, you can significantly improve the venture. Here are five focal points that should lead you to greatness.

Why Commercial Roofs Are Quietly Becoming Smart Infrastructure

Here's something most building owners don't think about until it's too late: the roof over your head is no longer just a passive layer of protection. It's becoming one of the most strategically important assets in your entire portfolio.

The Strategic Advantage of App Intelligence: How Data-Driven Insights Fuel Mobile Growth

In today's hyper-competitive mobile ecosystem, launching an app is no longer the hardest part-scaling it is. With millions of apps competing for attention across major app stores, success depends on more than just a great idea or clean design. Developers, marketers, and analysts must rely on data to understand user behavior, monitor trends, and outmaneuver competitors. This is where mobile app intelligence platforms have become essential.

AI Meeting Bots Were Just the Beginning. Meet the AI Collaborator

Why the next era of enterprise AI isn’t about note-taking — it’s about digital workers who actually show up and do the work. There’s a moment every IT operations leader knows well. A critical incident hits at 2 PM on a Tuesday. Within minutes, a war room meeting spins up — a Google Meet or Teams call crowded with network engineers, SRE leads, cloud architects, and storage admins, all staring at dashboards and talking over each other. Someone is manually pulling syslog data.

Debug frontend issues with AI: Real user monitoring meets the Coralogix MCP server

It is 2 AM. Someone on-call gets paged. Conversion rates on the checkout page dropped 30 percent in the last hour. The immediate questions are familiar. Is this a JavaScript error? A slow API call? A broken third-party script? A performance regression that never throws an exception but quietly drives users away? In most teams, answering those questions is not hard because the data is missing. It is hard because the investigation is split across too many places.

Bitbucket outage on April 16, 2026: StatusGator detected issues 77 minutes earlier

On April 16, 2026, Bitbucket experienced a widespread outage that disrupted pipelines and core functionality for users around the world. StatusGator detected the issue 77 minutes before the provider officially acknowledged it, using its Early Warning Signals. This early detection gave teams critical time to respond, even while the official status page still showed everything as operational.

How to define your monitoring requirements (before you talk to a vendor)

This is a guest post from Laura Copeland. Key insights from a fireside chat with Chris Yates. Part 1. Choosing the right database monitoring vendor isn’t just a technical decision, it’s a strategic one that affects your teams, your estate, your growth plans, and the culture of your organisation. It’s also a personal one if you’re a DBA. Something as critical as your monitoring system will shape your day‑to‑day work, and, in many cases, how well you sleep at night.

Centralize observability management with Datadog Governance Console

As organizations grow, they face increasing difficulty in managing their observability efforts. More teams mean more dashboards, monitors, API keys, pipelines, and custom configurations. Without a centralized view, administrators spend hours chasing down untagged resources, investigating surprise bills, and revoking dormant credentials. Governance becomes a reactive effort to reduce waste and address issues, falling short of its potential to proactively create standards and optimize observability.

Honeybadger Insights Parameterized Queries

Make your Honeybadger Insights dashboards and queries dynamic with parameterized queries. In this short walkthrough, we'll take a static system dashboard — showing load average, memory, and disk usage across a fleet of hosts — and turn it into an interactive view you can filter to a single host with one click. What you'll see: Parameterized queries are a simple way to build one dashboard that serves many views — no duplication, no extra widgets, just a shareable URL.

Healthchecks.io Now Uses Self-hosted Object Storage

Healthchecks.io ping endpoints accept HTTP HEAD, GET, and POST request methods. When using HTTP POST, clients can include an arbitrary payload in the request body. Healthchecks.io stores the first 100kB of the request body. If the request body is tiny, Healthchecks.io stores it in the PostgreSQL database. Otherwise, it stores it in S3-compatible object storage. We recently migrated from a managed to a self-hosted object storage.

Setting Up an MQTT Data Pipeline with InfluxDB

In this blog, we’re going to take a look at how you can set up a fully-functioning, robust data pipeline to centralize your data into an InfluxDB instance by collecting and sending messages with the MQTT protocol. We’ll start with a brief overview of the technologies and protocols used in the pipeline, then dive into how you can connect, configure, and test them to ensure your data pipeline is fully functional. It’s going to be a long post, so let’s jump right in.

Every team should be A/B testing

Technical teams want to know the newest, most cutting-edge tools they can implement to give themselves a competitive advantage, whether it’s the latest developer framework or modern CI/CD practices that boost velocity. But there’s one tool from all the way back in the 1920s that can improve any organization, no matter its scale: the randomized, controlled trial—or simply put, experiments.

Network Instability: What It Is, What Causes It, and How to Fix It

Network outages are easy. Something goes down, alarms fire, you fix it, life moves on. Everyone understands a full outage. It's clean, binary, and at least somewhat predictable. Network instability is the opposite of all that. Nothing fully breaks. Nothing fully works. The ping responds. The connection shows active. And yet users are complaining about choppy calls, sluggish apps, and sessions dropping for no apparent reason. You run a speed test, and it's fine.

From Edge to Cloud: How Litmus Edge and InfluxDB Unlock Industrial Intelligence at Hannover Messe

If you’ve spent time in industrial environments, you know the problem isn’t a lack of data. It’s collecting it reliably, contextualizing it, and storing it at scale. Most stacks weren’t built to fight all three battles.

You Don't Need Three Pillars, You Need Single Threads

Last week was a great reminder for me about the challenges of the traditional model of observability defined by the “three pillars” of metrics, logs, and traces. One of the customers I’m currently working with is a large financial institution that has a robust three pillar implementation. Every critical application ships their telemetry to either or both their cloud-native tool and a central tool.

Route OTel data from AI apps to ClickHouse and Datadog using Observability Pipelines

As organizations continue to heavily invest in AI and build more agentic workflows, their telemetry data volumes can surge quickly, and the associated costs can become unpredictable. To regain control of their data, many AI-forward teams are turning to high-throughput, low-latency pipelines to collect and route data to tools such as OpenTelemetry (OTel) and ClickHouse. But these self-hosted solutions come with drawbacks.

Manage service tracing across hosts with Single Step Instrumentation rules

Single Step Instrumentation (SSI) simplifies Datadog Application Performance Monitoring (APM) by automatically discovering and instrumenting services across a host. For many teams, SSI is the ideal starting point because it helps them achieve full visibility with minimal setup. However, as environments grow, teams often want more control over which services get traced. Auxiliary workloads such as batch jobs and cron tasks might not require distributed tracing.

Modern IT and the Burden of Accountability

The leaders responsible for modern IT environments rarely talk about features first. They talk about responsibility. In conversations at Nexus Live 2025, ScienceLogic’s annual customer conference, executives and architects across healthcare, federal systems, managed services, telecom, and enterprise IT described modernization not as a tooling upgrade, but as an escalation of accountability.

Unified Enterprise Monitoring that Scales

Modernize your monitoring stack with the Progress WhatsUp Gold network monitoring solution in this fast, 30‑minute session. Learn how to replace legacy, multi‑module tools with one unified platform that simplifies operations, boosts visibility and delivers predictable TCO. Discover how NetOps and ITOps teams can reduce complexity and get actionable insights faster by utilizing the WhatsUp Gold capabilities to unify network traffic analysis, logs, configuration and high availability.

Sentry Built AI Dashboards: Monitor Your AI Agents End-to-End

Building AI applications? There's a lot more to monitor beyond errors. With tracing enabled, Sentry's built-in AI Dashboards give you deep visibility into how your agents are actually performing. This video walks through three key dashboard views: You'll also see how to drill from a dashboard widget straight into the trace explorer to pinpoint the root cause of errors, how to duplicate and customize dashboards to fit your needs, and how to set up monitors with alert thresholds - like getting notified if your LLM calls exceed 20 seconds.

Building a Unified Enterprise Observability Strategy Webinar

Join Graham Davies, Technical Product Manager at SquaredUp as he provides a practical guide to breaking down data silos between IT, operations and the business. In this session, Graham digs into why dashboard and tool sprawl is making decisions harder, not easier, and shows you a practical framework for building a single source of truth your whole organisation can rely on.

The Edwin AI Agent Orchestrator: Coordinated Incident Investigation Across the Tools You Already Use

Edwin AI’s Agent Orchestrator keeps incident investigation, context, and response aligned as work moves across tools, eliminating the manual handoffs that slow resolution. Every major incident has two timelines running in parallel. The first is the incident itself—services degrading, users affected, business impact accumulating. The second is quieter and just as costly: engineers switching tabs, re-explaining context to new responders, moving notes from one tool to another by hand.

Smarter Alert Management: Test on Historical Data, Review Transitions, and Preview Silencing Schedules

Alert fatigue usually isn’t caused by one thing. It’s the accumulation of thresholds that are slightly too sensitive, alerts that fire during known maintenance windows, and historical patterns that nobody has the tools to review easily. Fixing it requires better visibility into how alerts actually behave over time, and a way to test changes before they hit production. We’ve shipped three improvements to alerting in Netdata that address different parts of this problem.

VictoriaMetrics at KubeCon: Optimizing Tail Sampling in OpenTelemetry with Retroactive Sampling

Last month, the VictoriaMetrics team gave a talk on retroactive sampling at KubeCon Europe 2026. By writing this blog post, as a transcript of the session, we want to explain how retroactive sampling reduces outbound traffic, CPU, and memory usage in the data collection pipeline significantly compared to tail sampling in OpenTelemetry.

The End of Manual Instrumentation: Scaling Observability with OTel OBI & Coralogix

Traditionally, achieving deep visibility into distributed systems required significant trade-offs in engineering time. Collecting meaningful application metrics and traces required teams to embed language-specific agents, modify source code, or manage complex library dependencies across every service.

Debugging multi-agent AI: When the failure is in the space between agents

I've been building a multi-agent research system. The idea is simple: give it a controversial technical topic like "Should we rewrite our Python backend in Rust?", and three agents work on it. An Advocate argues for it, a Skeptic argues against, and a Synthesizer reads both briefs blind and produces a balanced analysis. Each agent has its own model, its own tools, its own system prompt. It worked great in testing. Then I noticed the Synthesizer kept producing analyses that leaned heavily toward one side.
Sponsored Post

How to Set Up Raygun's Remote MCP Server in Cursor and Codex

After introducing Raygun's original MCP server and our new remote-first version, the most common question we hear is: "How do I actually set this up and start using it?" This guide covers exactly that, two short videos walking through setup and a real error being solved in both Cursor and Codex.

Infrastructure Cost Visibility: The Missing Link in Modern IT Decision-Making

The expectations placed on infrastructure leaders have shifted in a way that is subtle on the surface but significant in practice, and much of that shift comes down to infrastructure cost visibility. Reliability and performance still matter, but they are no longer the differentiators they once were. Most enterprise environments are stable by design, and uptime is assumed. What has changed is the level of scrutiny around cost and decision-making.

Cloud cost visibility for different teams: Getting it right with custom dashboards

Most cloud cost dashboards are built for one audience. The finance team wants to see totals by department. The engineering team wants to see costs by service. The DevOps team wants to see environment-level breakdowns. When everyone looks at the same dashboard, nobody gets what they actually need. This is where tailored cloud cost visibility starts to matter. When a team can see its own costs clearly, it moves faster, takes ownership, and starts treating cost data like it actually matters.

Best Server Monitoring Tools in 2026 (8 Picks by Use Case)

The best server monitoring tools depend on what you actually need to watch. If you want unified metrics, logs, and traces in one SaaS, Datadog wins. For AI-driven root-cause analysis at enterprise scale, Dynatrace is the pick. If you want monitoring, status pages, and on-call scheduling at a flat monthly rate without per-host or per-seat surprises, Hyperping is the best value. For Windows-heavy networks, PRTG. For hybrid IT with deep plugin coverage, Checkmk. For open-source flexibility, Zabbix.

Icinga as Open-Source MSP Monitoring Software: Multi-Tenant Monitoring for IT Service Providers

If you run a managed service provider, your RMM software is the backbone of daily operations. Remote management, patch cycles, ticketing workflows – it handles the essentials. But if you’re monitoring more than a few dozen client environments, you’ve likely noticed that monitoring and management are not the same thing. And that difference matters more the larger you grow. This post is not about replacing your RMM.

Top 5 Zabbix Dashboarding Tools Compared

Zabbix collects a huge amount of operational data—metrics, alerts, host status, and performance trends. But turning that data into dashboards people actually use is a different challenge. Most teams start with the built-in dashboards. Then the requests start coming: At that point, basic dashboards aren’t enough. Teams start looking for ways to augment Zabbix visualization with tools that improve usability, sharing, and flexibility.

Best Digital Experience Monitoring Solutions: 2026 Buyer's Guide

A website that loads slowly or an application that freezes mid-transaction tells users something about an organization, whether intended or not. Digital experience monitoring exists to catch these moments before they accumulate into lost customers and frustrated employees. We’ll show you how DEM works, the leading platforms available, and how to select the right solution for specific organizational needs.

What Are DNS Records? DNS explained in simple terms | A complete guide

Learn how DNS (Domain Name System) works and why it's called the internet's phone book. This video breaks down the entire DNS resolution process, from cache checks to root servers, and covers every essential DNS record type, including A, AAAA, CNAME, MX, NS, SOA, TXT, PTR, SRV, and CAA records.

Site24x7 MSP: The all-in-one platform for managed service providers

Managing dozens of client environments you don't own, behind firewalls you can't see through, while keeping SLAs intact is the essential MSP predicament. Site24x7 MSP is a cloud-native platform built to solve it. From a single multi-tenant console, monitor servers, networks, applications, and cloud workloads across AWS, Azure, and GCP with agent-based telemetry that catches issues before they escalate. True data isolation and RBAC keep client accounts secure. White-labeled portals, domains, and agents make it look like your platform. AI-powered self-healing workflows resolve incidents automatically.

What Is an AI SRE? And Why Do They Need Live Runtime Evidence?

AI SREs are autonomous systems that handle incident triage, root cause analysis, and remediation by correlating logs, metrics, traces, and code signals. However, as they rely on pre-configured telemetry, the critical execution details of a specific failure, such as variable state and code paths, can often be missed. As a result, they either force users into manual redeploy loops or make inferences from partial data, diagnosing issues using probability rather than proof.

Grave improvements: Native crash postmortems via Android tombstones

Native crashes on Android have always been harder to debug than they should be. The platform has its own crash reporter (debuggerd) that captures the crashing thread, every other running thread, register state, and memory maps into a file called a tombstone. Tombstones have been a part of Android for a long time; in fact, they’ve been there in one form or another since Android's first commit.

N+1 Detection in AppSignal's OpenTelemetry Trace Timeline

N+1 query problems are one of the most common, and quietly damaging, performance issues in production applications. One extra query per record feels harmless in development. At scale, it becomes the reason your response times degrade and your database buckles under load. Today, AppSignal adds N+1 detection to its OpenTelemetry support. When we identify the pattern in a trace, we collapse the repetitive spans directly in the timeline, making the problem immediately visible in the trace itself.

Ephemeral Leaks and Automated BGP Route Leak Detection

Many BGP route leaks reported by automated detection systems are actually brief, low-impact artifacts of normal BGP convergence. Doug Madory examines examples from Cloudflare Radar, Routeviews, and Jared Mauch’s long-running leak detector to show how these “ephemeral leaks” arise, why they usually don’t disrupt traffic, and why they still matter for routing security.

What's New in InfluxDB 3 Explorer 1.7: Table Management, Data Import, Transforms, and More

InfluxDB 3 Explorer 1.7 is a step forward for anyone who wants to manage their time series data without constantly switching between the UI and a terminal. This release adds table-level schema management, the ability to import data from other InfluxDB instances, and a new Transform Data section to reshape your data, all within the Explorer UI.

The Shift Toward Autonomous Enterprises

In our previous post, Navigating the Complexities of Scaling AI in Enterprise Operations, we explored the “cost–human conundrum”, balancing the promise of automation and the realities of economics, skills, and governance. That discussion highlighted a critical inflection point: scaling AI is not just a technical challenge, but an organizational one.

Building Agent-Friendly CLIs - What we learned at Checkly

Building Agent-Friendly CLIs: Why Your AI Agent Already Loves the Checkly CLI Stefan explains why products, docs, and CLIs must be AI-ready as coding agents rapidly become primary users of the Checkly CLI. He outlines key CLI features for agent workflows: Stefan demos how an agent initializes project-tailored Checkly setup from scratch without any human intervention and also shows how agents can entirely automate the incident life cylce from resolution to status page communication.

Storytelling as Strategy: DEX Strategy 1:1 with Laura Reeves

In today's episode, Tom is joined by Senior Client Director Laura Reeves for a wide-ranging conversation on storytelling as the defining skill in digital employee experience. From her “squiggly line” career journey across marketing and client leadership to the evolution of DEX itself, Laura explores how the role of IT has shifted from fixing issues to shaping strategic narratives. They discuss the impact of the pandemic, the rise of experience-led organisations, and why the most successful professionals are those who can connect data to meaning.

What's New in WhatsUp Gold 2026.0

Watch this video to learn about the features included in version 2026.0 of WhatsUp Gold. Find more information in the 2026.0 Release notes: For all your Community news, technical content, and access to all things WhatsUp Gold check out our Community Hub. You'll also find our Forum for questions about our platform and sharing with other Community users.

When AWS us-east-1 Fails, Much of the Internet Fails With It

There are cloud outages, and then there are us-east-1 outages. That distinction matters because failures in AWS’s Northern Virginia region rarely feel like ordinary regional incidents. They tend instead to expose something larger and more uncomfortable: too much of the modern internet still behaves as though one place is an acceptable concentration point for infrastructure, control, recovery, and communication. When us-east-1 goes wrong, the problem is not only that workloads fail.

Why IncidentHub's Alerting is Better than Other Status Page Aggregators'

IncidentHub tracked 48000 SaaS and Cloud outages in 2025. The average organization depends on 100+ SaaS apps, making third-party vendor monitoring a crucial aspect of risk management and business continuity for almost all modern organizations. Better SaaS outage alerting is about monitoring the right parts of your third-party services, and routing alerts to the right people at the right time.

AppSignal MCP Now Supports OAuth - and GitHub Copilot

When we launched AppSignal MCP in beta, OAuth was on the roadmap but not yet shipped. We were issuing static bearer tokens — enough to connect Claude Desktop, Cursor, and Windsurf, but not the one-click install path in the MCP Registry, and not GitHub Copilot's recommended setup. That's fixed.

The 9 Application Performance Metrics You Need to Measure and Why

The tension between shipping speed and application performance has not changed much since this post was first published in 2020. What has changed is how quickly a team can detect, diagnose, and fix a problem. That difference is significant enough to warrant a revisit. The scenario from the original still plays out every week. Sales brings a priority feature that might degrade performance for some customers. The developer ships it and watches what happens.

Smart Home Care: How to Prevent Structural Damage Before It Costs You Everything

Your home is quietly working against you, sometimes for years, before the damage becomes impossible to ignore. Water finds its way behind drywall. Mold colonies establish themselves in crawlspaces you never visit. Foundations shift incrementally until one day, they don't shift back. For homeowners who genuinely care about smart home structural damage prevention, early action isn't a luxury; it's the foundation of everything else.

5 Best Website Monitoring Tools in 2026

The five best website monitoring tools in 2026 are Hyperping (all-in-one monitoring with on-call and status pages), Better Stack (monitoring plus logs and traces), UptimeRobot (budget-friendly with a generous free tier), Uptime.com (enterprise SLA reporting and synthetic monitoring), and Datadog (large-scale infrastructure monitoring). I tested 15 tools over three weeks, measuring check speed, alert accuracy, integration quality, and real-world pricing at different scales.

The Trust Layer: Why Enterprise AI Needs a Gateway Before It Needs More Models

Enterprise AI does not have a model problem. It has a trust problem. Before organizations invest in larger models or additional agents, they need a control layer that governs how those agents operate inside production systems. Without that layer, autonomy does not scale. If you talk to any enterprise leader right now, you’ll hear the same question.

Tracing a Slow Request Through Your Django App

Slow endpoints are difficult to detect because they don’t fail. They simply get slower and slower. Average latency may look fine, but that can be misleading. That’s why we need to look at other values, like p90 and p95, which often reflect what’s really going on. For example, p90 represents the slowest 10% of requests, and p95 represents the slowest 5%. When these values increase, users start experiencing delays.

The AI Zero-Day Wave Is Here. Is Your Logging Infrastructure Ready?

Last week, the cybersecurity industry received a signal it cannot afford to ignore. Anthropic announced Claude Mythos Preview: a general-purpose frontier AI model that, without any explicit training for the task, autonomously discovered and fully exploited zero-day vulnerabilities across every major operating system and web browser. Not theoretical capabilities.

User Feedback to Pull Request in Minutes with Cursor + Sentry

Cursor Automations + Sentry Triggers: go from user feedback to a pull request automatically. See how to set up an end-to-end workflow that turns feedback into code changes, posts the PR to Slack, and keeps your team in the loop. In this video, we walk through a real-world example using Sentry Docs. A user submits feedback through a widget on the docs site, it lands in Sentry as an issue, and when assigned, a Cursor Automation kicks off. The automation reads the feedback, validates it, generates a PR against the repo, and posts the link in the relevant Slack thread. No manual work required.

Fewer Tools, Faster Fixes: A Practical Guide to Observability Consolidation

Most observability stacks aren’t designed, they accumulate. A logging tool here, a tracing platform there, and before you know it you’re managing rising costs and a setup that ultimately slows down your team. And you’ve moved further away from actually solving problems for your users.

Next.js Overview Dashboard: Monitor Performance Beyond Errors

Building with Next.js and using Sentry? Our team put together a dedicated Next.js Overview Dashboard that gives you a full picture of your application's health, not just errors. Out of the box, the dashboard covers page loads, API latency, issue counts, performance scores, rage and dead clicks, and slow SSR. Since Next.js runs on both client and server, you get a breakdown of client transactions, server transactions, and your SSR file tree all in one place.

Offline evaluation for AI agents: Best practices

If you’re building LLM-powered applications and agents, you’ve probably asked yourself: “How do I know if my changes actually made things better?” You can tweak prompts, adjust temperature settings, or try different models, but it’s not always easy to validate whether version B’s response is better than version A’s. Most teams fly blind in preproduction and rely on user feedback to see how well their application works in the real world.

TV Mode: Put Your Dashboards on the Big Screen

One of the most common requests we’ve gotten since launching custom dashboards is deceptively simple: “How do I put this on a TV?” Teams want their dashboards on wall-mounted screens in NOCs, war rooms, and open office spaces. The dashboard is already built. The data is already there. They just need a way to display it on a screen that nobody is logged into, without exposing the full Netdata Cloud interface. TV mode does exactly this.

Grafana Alerting: Respond faster and get situational awareness with alert enrichment in Grafana Cloud

Alerts are meant to help teams respond quickly to problems, but too often they arrive without enough context to be immediately useful. An alert that says “CPU usage is high” still leaves the on-call engineer asking critical follow-up questions: Which service? Which environment? Where do I look next? Validating the alert and triaging the situation is the first step for every engineer. It's a manual step that takes time, extending every potential incident.

ICYMI: Is This Code Worth Running? Here's How to Know

Over the last three months, we’ve been exploring what about software development and observability changes with AI, and what doesn’t. Our conclusion: these five principles will still remain true, even when 90% of the code is AI-driven. The agentic AI space is moving fast. Models are improving, context windows are expanding, and the ways people build and operate agents are changing so fast that any thoughts we share could feel dated by the time you read this.

Top 5 ServiceNow Dashboarding Tools Compared

ServiceNow holds a wealth of operational data—but turning that data into dashboards people actually use is a different challenge altogether. Most teams start with what’s available out of the box. Then come the requests: At that point, dashboarding stops being simple. It then has to be “augmented” - with easy shareability, ease of use, contextualization and hierarchy.

Stop Wrestling With Complex Website Monitoring Dashboards

In the race to provide full-stack visibility, many modern SaaS platforms have inadvertently created a new problem: information overload. High-end enterprise solutions are designed for companies with dedicated Site Reliability Engineering (SRE) teams that spend their entire day inside a dashboard. But for many businesses, this level of granularity is a distraction. The real question isn’t whether a tool is powerful; it’s whether it fits the everyday needs of your team.

JSON Jiu Jitsu: Has JSON Parsing Got You in a Chokehold?

From malformed fields to endlessly nested objects, JSON logs can feel like they’re trying to submit your SIEM. In this technical session, we’ll demonstrate how to turn that chokehold into a clean takedown using Graylog’s parsing, normalization, and enrichment capabilities. You’ll learn how to: Whether you’re a SOC analyst tired of regex wrestling or an admin looking to streamline onboarding, you’ll leave with practical techniques to make messy JSON your sparring partner—not your opponent.

How to Monitor a Shopify Store with Playwright and Checkly

This is a guest post by Vince Graics, Staff QA Engineer at World of Books. If you're running a Shopify storefront and want reliable synthetic monitoring, you'll hit a wall. Shopify's bot detection doesn't care that your headless browser is friendly; it sees datacenter IPs and acts accordingly. Cart API calls get hit with 429 rate limits, Cloudflare challenge pages pop up mid-check, and you're left wondering whether the bug is in your code or in the platform fighting you.

From Stack Trace to Probable Cause: AI Root Cause Analysis Is Here

You know the drill. An error fires, you get the stack trace, and then you spend the next 45 minutes tracing it backward through four services, two config files, and a deploy that happened three hours ago. You eventually find the root cause, but the path to get there was manual, slow, and entirely dependent on how well you already knew the codebase. We built AI-powered root cause analysis (RCA) for that kind of slog.

A faster way to pinpoint performance bottlenecks: Using Profiles Drilldown with Grafana Cloud Knowledge Graph

When you identify CPU or memory spikes in your services, it’s critical to understand why they’re happening. But switching between tools or crafting complex queries can slow you down when trying to pinpoint a root cause. This is why we’re excited to share that Profiles Drilldown, an application that lets you easily explore profiling data through an intuitive, point-and-click interface (no queries required), is now integrated with Grafana Cloud Knowledge Graph.

Kubernetes Monitoring Helm chart v4: Biggest update ever!

The Kubernetes Monitoring Helm chart is the easiest way to send metrics, logs, traces, and profiles from your Kubernetes clusters to Grafana Cloud (or a self-hosted Grafana stack). And version 4.0 is the biggest update the chart has ever received. Representing nearly six months of planning and development, it's designed to solve real pain points that users have hit as their monitoring setups have grown.

How to manage synthetic monitoring checks as code with Terraform and Grafana Cloud

As teams scale, managing synthetic monitoring checks manually in the UI becomes difficult and error-prone. When you're dealing with dozens of checks across multiple environments, teams experience inconsistent configurations, lack of version control, and difficulty tracking changes.

Putting FinOps theory into practice with SquaredUp

The public cloud has revolutionized IT by making infrastructure on-demand, scalable, and self-service. However, this convenience comes at a price. In the cloud, engineers can instantly spin up resources and spend company money with the click of a button or a line of code, bypassing traditional procurement and finance approval processes.

Optimizing the OpenTelemetry Python SDK for LLM Workloads

Agentic workloads thrive with precision tooling. Just like developers, they need the rich context, high cardinality, and fast feedback loops that allow them to ask exploratory open-ended questions of their code. But instrumentation is costly, and from the dawn of software, developers have tried to do the most possible with the least amount of resources.

Top 6 AI SRE Tools and Why Runtime-Grounded Reliability Is the New Standard

AI SRE tools accelerate incident detection, root cause analysis, and remediation across distributed production systems. They ingest telemetry signals, including logs, metrics, traces, alerts, and deployment history, to correlate anomalies, narrow fault domains, and reduce manual triage. This guide breaks down the top AI SRE tools in 2026 and helps you choose the right one based on your team’s biggest bottleneck, whether that is faster triage, deeper root cause analysis, or runtime-level validation.

OpenTelemetry Project Updates from KubeCon EU '26 in 10 Minutes | The Road to Graduation

OpenTelemetry Project Updates | Observability Day Europe Catch up on the latest OpenTelemetry project updates from Observability Day Europe. This session covers recent stability milestones, new tooling, and what's in progress across the OTel ecosystem.

New Custom Dashboards: Metrics, Logs, Live Commands, and More in a Single View

Custom dashboards in Netdata have always let you pull charts together on-the-fly into a single view. That’s useful, but it’s also limited. In practice, when you’re running an incident or reviewing a service, you don’t just want charts. You want to see the output of top alongside your CPU metrics. You want slow query logs next to your database latency charts.
Sponsored Post

HIMSS 2026: The Future of Healthcare IT Operations Is Increasingly Autonomous

HIMSS 2026 made something clear: healthcare is no longer discussing digital transformation as a future-state goal. It is now dealing with the operational reality of having already become deeply digital. Conversations around HIMSS 2026 consistently pointed back to the same pressure points: AI adoption, cyber resilience, interoperability, and infrastructure modernization. Together, they reflect a healthcare environment managing more systems, more dependencies, and more risk than ever before.

Claude outage April 2026: what happened and how it was detected early

On April 9, 2026, Claude experienced a widespread but inconsistent outage that left many users unable to access or interact with the service. StatusGator detected the issue early and sent an Early Warning Signal 59 minutes before the provider officially acknowledged the outage. This incident highlights how early detection can provide critical lead time when official status pages lag behind real user impact.

In the Age of AI, Operational Memory Matters Most During Incidents

Artificial intelligence is making software easier to produce. That much is already obvious. Code that once took hours to scaffold can now be drafted in minutes. Boilerplate, integration logic, tests, refactors and small internal tools can be generated with startling speed. In some cases, even substantial pieces of implementation can be assembled quickly enough to make older assumptions about software effort look dated. It is tempting, then, to conclude that the hard part of software is receding.

Four Open-Source Developer Tools for Hyperping, Built by Develeap

Develeap, a DevOps consultancy, has been using Hyperping to manage monitoring across 57 tenants. That real production usage led them to build a set of open-source tools that extend Hyperping into the infrastructure-as-code, Python, and observability ecosystems. The result is four interconnected projects, each driven by a concrete operational need.

Manage Hyperping with Terraform: Community Provider by Develeap

If you manage more than a handful of monitors, you have probably wanted to define them in code rather than clicking through a dashboard. Terraform is the standard tool for that in the infrastructure world, and now there is a Terraform provider for Hyperping. Develeap, a DevOps consultancy, built this provider while managing monitoring for 57 tenants at scale. They needed infrastructure as code for monitors, status pages, and incidents, so they built it, tested it in production, and open-sourced it.

Beyond the Dashboard: Selector's Patented Approach to Conversational Observability

For years, IT operations teams have been trapped in a frustrating paradox: the data they need to solve critical issues is right at their fingertips, yet entirely out of reach. Accessing it requires engineers to master complex, platform-specific query languages, dig through endless layers of dashboards, and hunt for the exact visualization that holds the answer. Under the intense pressures of modern speed, scale, and complexity, this rigid model is breaking down.

The Real Path to AI Automation Starts With Less Fragmentation

Fragmentation limits AI automation because context is split across systems, forcing humans to bridge the gap. Most IT environments are fragmented by design. Observability data lives in one set of systems, investigation happens in another, and execution sits behind separate tools with their own ownership and controls. During an incident, context does not move with the work.

Network Monitoring Tools in 2026: How to Choose the Right Platform

Effective network monitoring requires path validation, not only device polling. Traditional Network Monitoring System (NMS) tools were built for static networks, not today’s hybrid reality. You poll devices, check interface counters, and still struggle to explain why users complain about latency. Traffic moves across SD-WAN architectures, cloud routing layers, and public internet paths that device metrics never capture.

The History of AI in IT Operations: How We Got to Autonomous IT

Autonomous IT is the result of a long operational evolution, from static monitoring and rule-based automation to AIOps and now to systems that can increasingly diagnose, prioritize, and act within defined guardrails. Autonomous IT gets talked about like it appeared out of nowhere. As if someone flipped a switch and suddenly systems started managing themselves. The reality is far less dramatic and far more instructive. What we’re seeing today is the result of decades of incremental progress.

The Runbook Problem: How AURA Documents What Teams Don't Have Time to Write

Runbooks are rarely missing because teams don't value them. They're usually missing because incident response, follow-up, and platform work compete for the same limited time. By the time an issue is resolved, the knowledge is fresh, but the window to document it is already closing. That gap creates familiar failure modes: over-reliance on senior engineers, slower handoffs, and less confidence for whoever is on call next.

Tech Talk | AI Agents in O11y Cloud

Transform reactive incident response with Splunk’s troubleshooting agents, designed to drastically reduce mean time to identify and resolve issues. This session demonstrates how a multi-agent approach empowers teams of all skill levels to pinpoint root causes, prioritize issues by business impact, and prevent future outages. Tech Talk sessions offer insightful and valuable deep-dives for any technical practitioner.

Alert Acknowledgement: Mark It as Seen, Keep Working

If you’ve ever opened the alerts tab during a busy period, you know the problem. There are alerts you’ve already looked at, alerts someone on your team is handling, and alerts that fired on a known issue that’s being worked on. They all sit together in the same list alongside the new ones you haven’t seen yet.

The Best SKILL.md Is the One You Never Update - Meet Checkly's CLI

Most agent skills are static — frozen documentation snapshots that go stale the moment APIs change or flags get deprecated. Checkly does it differently. Our SKILL.md is just 100 lines of CLI pointers. No baked-in docs. Your coding agent learns what it needs, when it needs it, straight from the Checkly CLI.

How We Do Support at Scout

Today, we are taking a break from your regularly scheduled technical programming to talk about support. Here at Scout, we consider support one of our differentiators, and even as we adopt AI as a human multiplier behind the scenes, we are committed to keeping it real on the human-interaction side. It will be a long time, if ever, that you reach out to us and get a response from an AI agent. Would it be cheaper? Sure, but it isn’t up to our standards, and we won’t compromise on that.

Nine Smart Ways To Fix Revenue Leakage Fast

Revenue leakage is the unseen loss of revenue due to process mistakes, process inefficiencies, or missed opportunities. This is an issue that any organization can experience, whether small or large, in any field. Mitigating such losses quickly improves the bottom line, enabling more resources for continued growth. The following are nine effective ways that will aid you in quickly correcting a loss of income and stabilizing your finances as they once were.

Top tips: Not all your thoughts are yours; here's what to do about it

Top tips is a weekly column where we highlight what’s trending in the tech world and share ways to stay ahead. This week, let's look at a few ways you can make your thoughts your own in this era of information overload. Have you noticed how you think about life decisions, current affairs, and spending patterns? Why do you think a certain way? Is it your upbringing, the media, or the internet?

Heroku vs AWS

Heroku vs AWS: these cloud platforms represent fundamentally different approaches to application cloud hosting. The decision between them often determines whether your team ships features in hours or spends days configuring infrastructure. Both platforms represent different philosophies in cloud computing, with Heroku prioritizing developer experience while AWS maximizes infrastructure control.

Spending More, Seeing Less: How Indexing Limits Capital Markets Visibility

Capital markets systems don’t scale linearly. A macro event, an earnings release, a sudden liquidity shift, and telemetry volume doubles in seconds. In most observability platforms today, that spike means one thing: every byte gets written to a high-cost index before a single query can touch it. There’s no middle ground. You pay full indexing cost for the compliance log that no one queries for six months, the same way you pay for the execution trace you need right now.

Sample AI traces at 100% without sampling everything

A little while ago, when agents were telling me “You’re absolutely right!”, I was building webvitals.com. You put in a URL, it kicks off an API request to a Next.js API route that invokes an agent with a few tools to scan it and provide AI generated suggestions to improve your… you guessed it… Web Vitals. Do we even care about these anymore?

The Path to AI-Ready Operations Begins with Truth

Enterprises expect AI to improve how they operate, yet many underestimate the level of clarity required for intelligent systems to perform reliably. AI-assisted operations demand input signals that are accurate, consistent, and interpretable. They require a unified understanding of how services behave, how disruptions originate, and how decisions influence downstream outcomes. This level of coherence is impossible without operational truth.

Uncertainty and Change Are Everywhere in Software Development

If you’re like everyone else who works in software development, it’s a good bet that almost every single thing that you thought you knew about your business and engineering has changed as a result of the advent of modern LLMs. How should you respond to these changes? How should you change how you and your team develop software?

Setting Up AppSignal for a Node.js App Running on Kubernetes

Monitoring in Kubernetes can seem like opening an airplane's black box. Everything happens silently, behind the scenes, hidden away. This can be a lot of trouble, as you don’t really want to dig through a bunch of logs at 3 a.m. after a call letting you know that a certain feature is broken. You want something direct, concise, and helpful.

Introducing OrionIQ: The End of Manual Observability

OrionIQ is Logz.io’s new agentic observability platform designed to move teams from detecting issues to resolving them automatically. As AI accelerates software development, operations remain manual: engineers still wake up at 2 a.m. to investigate alerts and rebuild context. OrionIQ uses AI agents to analyze real-time telemetry, investigate incidents, identify root causes, and take action across systems.

From Insights to Dashboards: Customize Your Sentry Experience

You fixed all the errors. But the job's not done. If you're using tracing, logs, metrics, or other Sentry products, there's a wealth of performance data scattered across your application just waiting to be surfaced. In this video, we walk through the move from Insights to Dashboards: giving you full control over how you view, filter, and customize your monitoring setup. Here's what's covered: Check out Dashboards in your Sentry organization and let us know what you think!

Nothing But [Inter]net 2026 Highlights

​We put the internet’s loudest developers in one room at Chase Center. On purpose. Tune in for highlights from the event from: ​Wes Bos and Scott Tolinski: hosts of your favorite developer podcast, Syntax. Taught half of you how to actually use React. ​Teej and ThePrimeagen: sell coffee through the terminal, have over a million YouTube subscribers and even more opinions on memes.

Instrument and monitor Boomi integration flows with OpenTelemetry and Datadog

Boomi is an Integration Platform as a Service (iPaaS) used by thousands of organizations to connect applications, data, and workflows across cloud and on-premises environments. Business-critical processes, from order fulfillment pipelines to customer data synchronization, depend on Boomi Atoms and Molecules running reliably.

Not all index scans are equal: How we cut query latency by over 99%

When engineers investigate SQL queries, they normally think of index scans as a fast and efficient step in the query’s execution plan. When executed correctly, they fetch only the relevant rows from your table as opposed to sequential scans that read the entire table, reducing latency and query costs. However, just because an execution plan uses an index scan doesn’t mean that the scan is fast or performant.

Platform engineering metrics: What to measure and what to ignore

Platform engineering teams have access to hundreds of metrics, yet over 40% of platform initiatives cannot demonstrate measurable value within the first year. Teams that cannot quantify their impact fail to obtain executive sponsorship, risk being defunded, and ultimately, face deprecation. To accurately calculate a platform’s ROI, platform engineering teams need to differentiate between signals that measure platform effectiveness and those that should be used solely for investigative purposes.

Integrate Recorded Future threat intelligence with Datadog Cloud SIEM

Recorded Future provides real-time threat intelligence about indicators of compromise (IOCs), including malicious IP addresses, domains, and vulnerabilities. It also adds context on threat actors and campaigns to help security teams understand which signals represent real risk and prioritize their responses accordingly.

OpenTelemetry Collector + Uptrace: From Zero to Your First Traces

Learn how to set up the OpenTelemetry Collector and connect it to Uptrace for distributed tracing, metrics, and logs. This step-by-step guide walks you through installation, configuration, and sending your first telemetry data — perfect for beginners and anyone looking to level up their observability stack.

VirtualMetric DataStream - Turn Chaos Into Clarity

Security teams lose time and detection quality to the same root cause: inconsistent, noisy, poorly structured data. VirtualMetric DataStream is a security data pipeline platform that fixes the data layer — so your SIEM, data lake, and analytics tools get clean, normalized, actionable telemetry. What DataStream delivers: The result: reliable security telemetry, faster threat correlation, and stronger detections across your entire stack.

VirtualMetric DataStream: Full setup from scratch in 14 minutes (v1.8.0)

From free trial signup to live security telemetry flowing into Microsoft Sentinel — this demo covers the full DataStream setup end to end, in under 14 minutes. No pre-built environment, no shortcuts. Watch the step-by-step tutorials.

How In-Vehicle Technology Is Making Driving Safer and Simpler

Modern vehicles are no longer just modes of transportation. They have evolved into intelligent systems designed to make driving safer, more efficient, and far less stressful. With rapid advancements in in-vehicle technology, drivers now benefit from features that actively prevent accidents, simplify navigation, and enhance overall control behind the wheel.

Episode 9 - AI, Enterprises, and the Law

In this episode of The Intelligent Enterprise, host Tom Stoneman takes us inside the different ways that AI is being utilized in the practice of law. In this episode, Tom is joined by Vintee Mishra, an attorney who’s currently part of the Commercial Contracting Organization at Navy Federal Credit Union, and has previously occupied supporting roles at Tata Consultancy Services, Cisco, First Technology Credit Union, and Moody’s Analytics.

Where Most Operational Waste Comes From-and How AI Automation Cuts It

Most operational waste comes from fragmented workflows rather than individual performance constraints. An incident begins long before any fix is applied. Alerts trigger, tickets open, and engineers start reconstructing context across systems that were never designed to operate as one. Logs, metrics, past incidents, and runbooks sit in separate tools, each requiring manual lookup, interpretation, and validation before any decision can be made.

Four Modern PHP Features That Show How Far the Language Has Come

PHP has evolved over the years and has become a lot more reliable, faster and refined. And with the release of PHP 8, which contained many features (named arguments, union types, attributes, constructor property promotion, match expressions, the null safe operator (?->) etc) and optimizations (JIT compiler), PHP has become more faster and cleaner. There are many more improvements and interesting features in the later versions of PHP 8. The 4 features I now rely on and wish PHP had introduced much earlier.

2026 Product Roadmap

Over the past 11 years, we have focused on one problem: ensuring complex conversion journeys work reliably in the real world. Across ecommerce platforms, travel services and large consumer websites, these journeys are where revenue is generated and where reliability matters most. In 2026, our focus sharpens further. The theme for the year is simple: Higher signal trust. Deeper intelligence. Stronger operational resilience.

HTTP Monitoring: What Is It and How to Do It

When users complain that an app or website is slow, the first question is always the same: Is it the network or the application? HTTP monitoring gives you the answer. Network metrics like latency and packet loss tell you what's happening on the wire. But they don't tell you whether users are actually feeling the impact. HTTP monitoring closes that gap.

Closing the Mobile Visibility Gap: Extending DEX to Mobile

In 2026, I think it’s safe to say that most mobile devices in enterprise organizations aren’t purchased just for their ability to make calls. And for millions of employees, especially frontline workers, their primary device isn’t even a laptop anymore - it’s a smartphone or tablet. Yet, mobile device insights have largely remained a blind spot for IT.

The Art of Scaling: How to Determine the Right Number of Apache Kafka Partitions

Apache Kafka partition count isn't just a number—it defines parallelism, ordering, and operational complexity. Learn the formula to balance throughput requirements with maintenance costs, avoid common anti-patterns, and find your 'Goldilocks' number for production-ready performance.

Progress WhatsUp Gold 2026.0: Proactive Visibility. Trusted Security.

Announcing Progress WhatsUp Gold 2026.0 Modern networks are more complex and more exposed than ever. From hybrid infrastructure and distributed devices to expiring certificates and tightened security requirements, network and IT teams are under constant pressure to keep everything running smoothly while reducing risk. Progress WhatsUp Gold 2026.0 is built for that reality.

Introducing CertKit: SSL Certificate Automation for the Rest of Us

We’ve been quietly solving a problem that most teams haven’t hit yet, but they’re about to. SSL certificate lifetimes are dropping to 47 days. If you’re managing certificates manually today, you have a very short window before that becomes a real operational problem. We know, because it happened to us first.

Business metrics in Grafana Cloud: Get an AI assist to help securely analyze your data

For today's modern businesses, the data landscape demands security and flexibility. You need to connect your observability platform to rich, proprietary datasets that often reside in private networks without compromising security or managing complex network infrastructure. You may also face an extra layer of complexity in order to effectively query and visualize that data. Luckily, modern artificial intelligence tools have made these previously complicated processes much simpler.

Telegraf Overview - InfluxData's Metric Collection Agent

Telegraf is InfluxData’s open source agent for collecting metrics, and it’s used everywhere. In this quick overview, Product Manager Scott Anderson shares what makes it stand out, from more than 5 billion downloads to a huge plugin ecosystem with 400+ integrations. It’s also built by a strong community, with over 1,300 contributors and thousands of GitHub stars. That momentum is a big part of why Telegraf keeps growing.

Overview of Cloud Status Check

In this video, we walk you through Uptime.com's Cloud Status check feature, designed to monitor the status of common cloud services within your technology stack. Learn how to configure a Cloud Status check, select third-party services, choose which components to monitor, and understand how the Down state works when multiple components are affected. We also cover how to opt out of maintenance notifications, view incident history, and organize checks with tags.

Expanded Chart View: Investigate Without Leaving the Chart

Charts in Netdata have always been interactive. You can zoom, pan, select time ranges, and see per-second granularity across thousands of metrics. But when you spotted something interesting, the next steps usually meant leaving the chart: opening another tab to check a related metric, navigating to the correlation tool, or pulling up a different time range for comparison. The investigation workflow lived outside the chart, even though the chart was where the investigation started.

7 Best Network Monitoring Software in 2026 and Beyond

More data leads to complex networks the solution to optimize complex network is a comprehensive network monitoring software. Many business organizations suffer from performance lapses as they don’t know what is the issue with their data network. They find themselves in an infinite loop of missed opportunities due to non-optimal network monitoring solution. This leads to the question – are we praising the network software now?

14 Best Service Desk Software Tools Ranked by IT Pros (2026 Guide)

Choosing the best service desk software is one of your most important investments as a service-focused organization. We have seen how the right tool can transform IT support operations—and how the wrong one can create more problems than it solves. While comparing IT service desk tools, I discovered something surprising: pricing differences are staggering. Zendesk starts at $55 per agent/month, but alternatives like Desk365 begin at just $12.

Top 12 IT Asset Management (ITAM) Tools & Software for 2026

“Guys, where’s the invoice for that firewall upgrade last quarter?” asked Jason, the IT Operations Lead, during a surprise internal audit. Stella from procurement replied, “I think it’s on one of the shared drives… or maybe with Finance?” Meanwhile, Roman, the System Admin, had no idea who was using half the software licenses in the network. This is classic IT asset chaos: too many tools, scattered records, and no clear visibility.
Sponsored Post

How to Monitor AWS Status: Don't Wait for the Health Dashboard

The AWS Health Dashboard is slow, sometimes broken during major outages, and only tells you what AWS admits is broken. Real SREs layer three monitoring sources: AWS-native tools (CloudWatch, EventBridge), third-party aggregators (IsDown), and internal synthetic checks. Skip the vendor status page as your primary alert source.

The future of SaaS is hazy and no one really knows what comes next

There was a time when SaaS felt predictable. You built something useful, scaled it, and charged a subscription. If the software did well enough, growth followed. It wasn’t easy, but it was clear. There was a sense of direction, a playbook that most companies seemed to follow, tweak, and succeed with. Ironically enough, the same playbook gave birth to numerous tech giants as we know them today. Now, that clarity feels different. Not entirely gone, but blurred. If you work in SaaS, you can feel it.

Traditional Automation vs. AIOps vs. Self-Healing Ops vs. Autonomous IT Explained

Autonomous IT becomes real when teams move from insight to governed action. Most IT teams still operate on an alert-first, human-coordinated model. When something breaks, alerts fire across multiple tools, engineers get pulled in, and the first part of the response goes to figuring out who owns the problem, which signals matter, and how far the impact has spread. Containment comes after that. That sequence made sense in slower, more isolated environments.

Query fair usage in Grafana Cloud: What it is and how it affects your logs observability practice

In Grafana Cloud we use a simple yet generous formula that lets you query up to 100x your monthly ingested log volume in gigabytes for free. This works for the vast majority of our customers, but if you aren’t careful and strategic with your usage, you could find yourself with an overage bill.

How to Set Up Your Monitoring System Alerts

You could have the most detailed metrics displayed on your dashboard, but if no one gets notified when things break, you’re just collecting data. Alerts help turn this passive monitoring into an active response. It’s like they tell you, “Hey, your error rate just spiked!” or “Your memory usage is through the roof,” even before your users start filing support tickets, or worse, give up on your tool entirely.

AI agent observability: The developer's guide to agent monitoring

Most "agent observability best practices" content reads like a compliance checklist from 2019 with "AI" pasted over "microservices." Implement comprehensive logging. Establish evaluation metrics. Create governance frameworks. Not a single line of code. No mention of what happens when your agent silently picks the wrong tool on turn 3 and you need to figure out why.

Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA

This guest blog post is by Tohn Furutani, SRE Engineer at NTT DATA. Over the past year, the conversation around generative AI has shifted from single-shot use cases—such as summarization, Q&A, and chat interfaces—to agentic AI systems that can make decisions based on context, plan multistep actions, invoke tools, and adapt as conditions change.

New Plugins, Faster Writes, and Easier Configuration: What's New with the InfluxDB 3 Processing Engine

The Processing Engine is one of the most powerful features in InfluxDB 3. It lets you run Python code at the database—transforming data on ingest, running scheduled jobs, or serving HTTP requests—without spinning up external services or building middleware. You define the logic, attach it to a trigger, and the database handles the rest. Since launching the Processing Engine, we’ve been building out both the engine itself and the ecosystem of plugins that run on it.

The Next Phase of Agentic AI

The Enterprise AI Survey conducted by Digitate in collaboration with Sapio Research states that the journey of enterprise automation and AI adoption has evolved significantly. The initial waves focused primarily on improving accuracy, efficiency, and reducing costs. Now, the next phase, Agentic AI, is transforming this shift from mere automation to dynamic collaboration.

The Cost of Operating Without Truth

Enterprises have reached a point where the pace of modernization no longer depends on the number of tools they deploy or the volume of telemetry they collect. Progress depends on whether teams can form a consistent and verifiable understanding of what is happening inside the environment. Many organizations do not realize that the single greatest barrier to modernization is the absence of operational truth.

Practical AI-Enabled Observability for Agents and LLMs

You’re told to “go build agents” without clear guidance on what that actually means, how to do it well, or how to know if it is working. You are not a data scientist. You are a software engineer. In this talk, a Datadog AI product leader Shri Subramanian breaks down what changes when you move from building applications to building AI agents, and why familiar approaches like traditional testing and linear delivery fall short. We will explore how agent development shifts the focus from code alone to data, prompts, and evaluation, and why functional reliability matters just as much as operational reliability.

LLM Cost Monitoring with OpenTelemetry

Teams running LLM applications in production face a cost problem that traditional APM tools were never designed to solve. CPU and memory costs are relatively predictable — a web service processing 1,000 requests per second costs roughly the same week over week. LLM API costs are not. A single user session can cost $0.01 or $5 depending on prompt length, model choice, conversation history, and how many retries happen inside your chain.

Top 5 Continuous Monitoring Tools and Why Runtime Context Is the Layer They Are Missing

Continuous monitoring tools track system health, performance, and behavior in real time across production environments. For a deeper understanding of how this fits into modern DevOps practices, see this guide on continuous monitoring and its impact on DevOps. They collect logs, metrics, and distributed traces across the infrastructure and application layers, giving engineering teams visibility into how their systems are running, where anomalies occur, and when something needs immediate attention.

Ep 37: Robbing banks is now a work from home job

In this episode of Masters of Data, we explore how banks and fintech companies have traded friendly neighborhood tellers for data-driven, always-on digital fortresses. We unpack everything from sophisticated phishing schemes and viral TikTok check fraud trends to the AI-powered tools that now handle the fraud detection Shirley the bank teller used to manage through sheer familiarity. We make the case that financial institutions today face more pressure than ever to be trustworthy, secure, and seamless all at once, whether their customers are logging into a sleek app or calling a landline to pay two bills a month.

Why AI Spells the DEATH of Workplace "Coasting": Jacob Morgan returns

Jacob Morgan returns to The DEX Show for another provocative conversation on the future of work, AI, and why 2026 is the year of accountability. Jacob argues that AI is exposing “performative work,” forcing organizations to rethink culture, leadership, and what real value creation looks like. We explore why company culture became too vague, why human judgment matters more than ever, and how leaders can avoid over-relying on AI at the expense of discernment, responsibility, and individuality. It’s a wide-ranging discussion on work, ambition, and the high-stakes reset now unfolding inside modern organizations.

How AI Is Powering the Next Era of IT Operations

AI is redefining the future of IT. In this Nexus Live 2025 keynote, ScienceLogic CEO and Founder Dave Link shares the vision behind Skylar AI, why the industry is shifting toward autonomous operations, and how organizations can move faster, smarter, and more proactively than ever before. In this session you’ll see.

Stop Starting Your Day in a Stack Trace

Most teams triage errors the same way. Check the error tracker in the morning, skim the stack traces, pick the ones that look urgent, start investigating. The rest pile up. By the time anyone gets to the long tail of production errors, the context is stale and the motivation is gone. What if that first pass happened automatically? We’ve been experimenting with a workflow that connects Scout’s error data to AI assistants through our MCP server.

March 2026: IsDown Users Saved 10.5 Hours with Early Outage Detection

In March 2026, IsDown users collectively saved 10.5 hours by receiving outage alerts before vendors officially acknowledged problems. The most significant early detection gave users a 2.3-hour head start when The Federal Reserve's FedACH system experienced issues. This data reveals the persistent gap between when users experience problems and when vendors update their status pages.

New Features: Team Members and Additional Email Recipients

DNS Check now supports two features for Enterprise accounts that make it easier to work as a team: Team Members and Additional Email Recipients. Team Members lets multiple people log in and work with your DNS records using their own credentials. Additional Email Recipients sends notification emails to people who need to stay informed but don't need to log in.

AI Working for You: MCP, Canvas, and Agentic Workflows - Part 2

In our previous post in our series on observability for the agent era, we looked at how Honeycomb provides unique visibility into LLMs operating in your production environment. Now, let’s flip it around and explore how Honeycomb provides observability insights uniquely suited to helping your AI agents rapidly diagnose and fix production issues, and build production feedback into the next round of development.

The Fundamentals: Fast, Deep, and Ready for What Comes Next - Part 3

The previous two posts in this series have looked at some of the use cases Honeycomb customers are implementing to observe LLMs in production and power agentic observability workflows. In this third and final post, we’ll take it back to basics and look at how the fundamental capabilities and infrastructure of Honeycomb provide the comprehensive data and fast performance that makes these use cases work at production scale. AI capabilities built on a weak observability foundation fall apart fast.

End to End Reliability for all your Workloads

Delivering great products to your customers requires a mix of evolution and consistency. To really land with users your product has to be ready to adapt and scale, prioritizing across a mix of customer and business needs. Join experts in reliability, systems engineering, and DevOps as they share real-world examples, true stories of pitfalls, and astounding impact from the experiments they have run. Learn how experienced practitioners handle failure, adapt to scale, and bridge gaps between teams to improve software performance and customer outcomes.

We Know Before it Breaks: Observability-Driven Development

When stakeholders push for faster growth (new markets, new features, newly modernized stack) your engineering model has to change too. At FitnessPassport, the shift from offshore waterfall delivery to an in-house team meant rebuilding not just services, but confidence: legacy systems with weak logging and little visibility made it hard to know whether changes were working and impossible to spot issues before users did. In this talk, Director of Engineering Rob Mitchell will share how FitnessPassport adopted Datadog and used structured logs, metrics, and traces to tighten feedback loops.

From Manual Requests to SelfServe: Building an AccessControlled App that Adapts Automatically

Platform teams often end up as the bottleneck for “small” operational asks: add a new button, wire up a workflow, expose one more cloud capability—each change requiring engineering time, reviews, and releases. In this technical deep dive, engineers from the Department of Government Services (Victoria) share the architecture and open source CDK library behind their “Infrastructure Control Panel”: a modular operational enablement app that lets non-technical users interact safely with cloud resources through strong access controls.

Capture and analyze custom heatmaps in Session Replay

Datadog Session Replay heatmaps track where users click, scroll, and engage across your web pages. Each heatmap is overlaid on a screenshot of the page, and that background determines what you can actually analyze. But getting the right screenshot can be tricky. Many UI states are dynamic, rare, or simply impossible to capture from replays, so heatmaps can end up showing the wrong view.

Beyond Maintenance: Why Modernizing Your Messaging Infrastructure is the Ultimate Competitive Edge

Modernizing messaging infrastructure delivers 188% ROI and payback in under 6 months, according to Forrester TEI study. Move beyond maintenance cycles to unified visibility, AI-driven efficiency, and secure self-service that transforms middleware from bottleneck to competitive advantage.

Top 10 Website Monitoring Tools of 2026.

Most website monitoring tools look similar until the first real incident. That is when alert speed, false positives, check coverage, and day-to-day usability matter more than a long feature page. UptimeRobot often comes up early for a reason: it is easy to start with, clear to manage, and focused on the checks many teams need first. Still, it is not the only option worth looking at.

How to check if an item is back in stock?

Are you one of those trying to desperately get your hands on a new RTX 3080, 3070, 3060 Ti, & 3090 in 2021? Or maybe you prefer the new PlayStation 5 or Xbox Series X console. Basically, any item that’s on pre-sale or hard to get (including the uniquely designed piece of clothing for your girlfriend). If your favorite online store doesn’t have a “watchdog”, we have the best solution for you. Now how would you know it’s already back in stock? There’s an easy way!

Employee Monitoring Software for the Modern Workplace in 2026

Most managers don't want to spy on their employees. But when your team is spread across three time zones and half of them work from home, knowing what's actually getting done isn't spying. It's just good management. Employee monitoring software has changed a lot in the past few years. It's no longer just about clocking in and out or taking screenshots every 10 minutes. The best tools today help teams work better, not just track whether they're working at all.

VictoriaMetrics March 2026 Ecosystem Updates

Welcome to the March release roundup of VictoriaMetrics Stack, covering key enhancements in VictoriaMetrics and VictoriaLogs. These updates deliver improved UI scalability, enhanced authentication flexibility, improved query performance, and logging tools that streamline observability workflows in production environments. This roundup covers releases for.

From alerts to action: Where reliability is actually won

Observability has evolved dramatically in the past decade. The industry has moved from basic uptime checks to full-stack observability (FSO), including metrics, logs, traces, and real user monitoring. Observability tools like ManageEngine FSO can detect anomalies in little time. And yet, outages still last longer than they should. Observability has matured. Response hasn’t. Most IT teams today have the tools to know when something breaks. But knowing is not the same as resolving.
Sponsored Post

How to Centralize Incident Notifications in Slack

Even a brief outage in a critical service can disrupt projects. Customers get frustrated and flood the support team with tickets. What's the solution? Centralizing incident notifications and real-time status alerts in Slack. Many teams already collaborate there anyway. So let's take a look at how teams can streamline service monitoring, alerting, and incident workflows in Slack using integrations, automation, and tools like StatusGator.

The single pane of glass approach to cloud monitoring

Dozens of SaaS services you depend on, starting from Google Workspace and Slack to Shopify, may experience downtime, partial outages, or degraded performance. And most have their own status pages, APIs, or RSS feeds. Juggling all these sources is exhausting, and many teams suffer from alert fatigue, missed early warnings, and fragmented visibility.

Paris | Observability Unleashed - Boostez vos opérations IT, DevOps & SRE

La complexité des environnements IT ne cesse de croître. La visibilité en temps réel n'est plus une option. Le 14 avril 2026, Stéphane Estevez , EMEA Observability Market Advisor chez Splunk, vous invite chez Cisco à Paris pour un événement dédié à l'observabilité, avec les équipes Splunk & Cisco. Au programme : Observabilité assistée par l'IA Stratégies de données intégrées OpenTelemetry simplifié De la donnée à l'action, avec des cas concrets et démos live Observabilité pour l'IA et par l'IA.

KubeCon + CloudNativeCon EU 2026: What We Learned About AI, Observability, and Fast Feedback Loops

Honeycomb was excited to attend KubeCon + CloudNativeCon Europe, where one theme stood out across sessions: as AI reshapes how software is built and run, teams are being pushed to rethink how they understand their systems. Without strong observability and feedback loops, AI can accelerate confusion, misalignment, and operational risk.

The Business Case for AI-Driven Observability in Network Operations

Modern network operations generate an extraordinary amount of telemetry. Metrics, logs, events, topology data, cloud signals, and service context all contribute to a richer picture of system behavior. As environments expand across cloud, data center, edge, and SaaS, the opportunity for operations teams is clear: when that telemetry is unified and understood in context, it becomes a powerful source of resilience, efficiency, and business insight.

Streaming Video Monitoring: How to Detect Playback Issues Before Viewers Leave

Video is the single largest driver of internet traffic worldwide. According to the Sandvine Global Internet Phenomena Report, video accounts for 65% of all internet traffic, with on-demand streaming alone consuming over half of all downstream bandwidth on fixed networks. In the United States, households spend nearly five hours per day streaming content, and 94.6% of internet users worldwide watch online video monthly.

When we say "Observability AI Reckoning," what are we actually talking about?

We’ve spent the last decade collecting more telemetry. Now AI is analyzing it. Here’s the catch: AI needs the full dependency chain to reason correctly. If it sees spans but not storage contention… Services but not Kubernetes scheduling… Frontend metrics but not downstream providers… It will confidently optimize the wrong thing. AI doesn’t lower the need for observability. It raises the standard.

Profiling Java apps: breaking things to prove it works

Coroot already does eBPF-based CPU profiling for Java. It catches CPU hotspots well, but that's all it can do. Every time we looked at a GC pressure issue or a latency spike caused by lock contention, we could see something was wrong but not what. We wanted memory allocation and lock contention profiling. So we decided to add async-profiler support to coroot-node-agent. The goal: memory allocation and lock contention profiles for any HotSpot JVM, with zero code changes. Here's how we got there.

AI Didn't Kill the SDLC. It Made It Harder to See

Whilst AI has compressed the visible stages of software delivery; requirements, validation, review and release discipline have not disappeared. They have been pushed into automation, runtime and governance. The real risk is not that the lifecycle is dead, but that organisations start acting as if accountability died with it.

Send your existing OpenTelemetry traces to Sentry

You spent months instrumenting your app with OpenTelemetry. The idea of ripping it out to adopt a new observability backend is not an option. Sentry's OTLP endpoint means you don't have to. In fact, two environment variables are all you need and your existing traces start showing up in Sentry's trace explorer. Sentry's OTLP support is currently in open beta. This means you can start using it today, but there are some known limitations we'll cover later.

Operational Truth: The KPI Every C-Suite Will Rely On Next

C-suite leaders are redefining how they measure digital performance. Reliability, customer experience, resilience, and cost efficiency still matter, yet these indicators only hold value when they reflect what is actually unfolding inside the environment. Digital ecosystems have reached a level of complexity where small deviations influence outcomes, and leaders increasingly recognize that traditional metrics cannot be trusted without contextual grounding.

BIND 9 CVE-2026-1519: The NSEC3 DoS Vulnerability Putting DNS Resolvers at Risk

On March 25, 2026, the Internet Systems Consortium (ISC) released patches for three vulnerabilities in BIND 9, the most widely deployed DNS server software in the world. The headline flaw — CVE-2026-1519 — carries a CVSS score of 7.5 and is remotely exploitable with no authentication required. An attacker who controls a maliciously crafted DNS zone can trigger the vulnerability by forcing a BIND resolver to process excessive NSEC3 iterations during DNSSEC validation of an insecure delegation.

On-Call Scheduling for Small Teams: Skip the Enterprise Complexity

Updated April 02, 2026 Most on-call guides are written for companies with 50+ engineers, dedicated SRE teams, and budgets for tools that cost $21 per user per month before you even add a second escalation tier. If you have 5 people and a product that needs to stay up, that advice doesn't apply to you. I'm Leo, founder of Hyperping.

Status Page Subscriber Management: Notification Groups, Components, and Templates

Your status page is only useful if the right people get the right notifications at the right time. A page that blasts every incident to every subscriber will train people to ignore your emails, or worse, unsubscribe entirely. A page that notifies too slowly will leave customers finding out about your outages from Twitter before they hear from you. I'm Leo, founder of Hyperping.

KubeCon Europe 2026: OpenTelemetry Recap from Amsterdam

The reason why I like writing recap articles is because AIs don’t have enough context to write them for us. You have to be there, in person, listen to sessions, interact in the hallways with the community, and absorb as much new knowledge as possible. That’s what I did last week in Amsterdam at KubeCon + CloudNativeCon Europe ‘26. Well, at least I tried to. Let me break down what I consider the most interesting topics were last week.

What's New in InfluxDB 3.9: More Operational Control and a New Performance Preview

We’ve spent the last few months listening to how teams are running InfluxDB 3 in the wild. The feedback was clear: as you scale, you need less “guesswork” and more control. Today’s release of InfluxDB 3.9 is our answer to that. As more teams move InfluxDB 3 into production, our focus has shifted toward the operational experience: how you manage the database at scale, how you ensure it remains secure, and how you provide a seamless experience for users.

Monitor ClickHouse query performance with Datadog Database Monitoring

ClickHouse is widely used for large-scale analytics, but once it is running in production, it can be difficult to understand how query activity translates into resource usage. Engineers investigating performance issues often struggle to determine which queries consume the most memory, run most frequently, or cause spikes in load. In practice, engineers are left querying system.query_log, tailing server logs, and piecing together information after an incident.

How we designed empathetic alert sounds for on-call engineers

Being on call is an essential part of operating reliable distributed systems, but it comes with real human costs such as alert fatigue, sudden wakeups in the middle of the night, and the ongoing anxiety of what the next notification might bring. Many engineers know the feeling: Your phone lights up, a sound cuts through the silence, and your heart rate spikes before you’re even fully awake.

Search and act across Datadog to resolve issues faster with Bits Assistant

Finding the right information across dashboards, monitors, and telemetry sources takes time, even for experienced engineers. When something breaks, it often means figuring out where to start, rebuilding queries, and jumping between metrics, logs, and traces before you can take action. The challenge isn’t a lack of data but the effort required to surface the right information at the right moment.

Understand session replays faster with AI summaries and smart chapters

Datadog Session Replay gives teams a video-like view of what real users experienced in their applications. Engineers rely on replays to connect errors and slowdowns to actual user behavior, while product managers use them to understand friction and improve critical flows. But finding the right replay and the right moment often means manually scanning long sessions without knowing whether they contain relevant signals.

Conversations: Ask Netdata About Anything You're Looking At

Netdata AI can already troubleshoot your alerts and generate Insights reports. What it couldn’t do, until now, was have a back-and-forth conversation. You could get a one-shot analysis, but you couldn’t ask follow-up questions, pull in additional context, or go from a quick question to a full investigation without starting over. We’ve added a conversational layer to Netdata AI.

Distributed Tracing | Debugging your Next.js applications with Sentry

Sometimes a simple stack trace won’t provide enough information for you to debug the issue at hand. There are types of issues that require you to know what happened leading up to the exception. In those cases, reach for tracing. Distributed tracing gives you an overview of every operation that happened during the execution of a certain functionality across your whole stack. Aside from being an awesome debugging tool, it also lets you identify any performance bottlenecks in your application. In this video you’ll learn how to view traces in Sentry and implement them in your Next.js application.

The Hidden Cost of Separate Monitoring and On-Call Tools

Most engineering teams I talk to run at least two or three separate tools for monitoring, on-call, and status pages. UptimeRobot or Pingdom watches the services. PagerDuty pages the on-call engineer. Statuspage.io tells customers what is happening. The dollar cost of this stack is easy to calculate. The hidden costs are harder to see, and they add up faster than the subscription fees.

From Reactive to Proactive: AI-Driven Automation for Shopify Infrastructure Monitoring

Operations teams manage Shopify infrastructure with their eyes half-open most days. You're monitoring system health across multiple layers, responding to alerts when they fire, and hoping you catch problems before customers notice. The whole setup is reactive by design. Something breaks. You get paged. You investigate. You fix it. But here's what most ops leaders don't realize: your Shopify operation generates enough signals to predict problems hours (sometimes days) before they actually occur. The data's there. You're just not analyzing it at the right scale or speed.

The Agent Runtime Needs an Enterprise Brain: Why Fabrix.ai Completes the NemoClaw / DefenseClaw Stack

The agentic AI security stack is taking shape , fast. At GTC 2026, NVIDIA unveiled NemoClaw, an open-source stack that wraps OpenClaw with enterprise-grade privacy controls, local inference via Nemotron models, and the OpenShell sandboxed runtime. Days later at RSAC 2026, Cisco launched DefenseClaw, an open-source governance framework that scans every agent skill, MCP server, and plugin before admission , and enforces block/allow policies at runtime with sub-two-second enforcement.

Five Ways Avantra Makes SAP More Secure

Enterprises use SAP well beyond simple back-office only accounting software. Today’s SAP systems are highly integrated and used by thousands of people daily across dozens of departments, and that’s just for a single large enterprise! As a central part of business operations, getting SAP security right, and durable operations with it, have become essential responsibilities for IT teams.

March 2026 Early Warning Signals

March 2026 saw a steady wave of service disruptions across SaaS platforms, developer tools, and infrastructure providers. What stood out wasn’t just the volume of incidents, but how early many of them surfaced. Using StatusGator’s Early Warning Signals, outages were often detected well before providers acknowledged them, sometimes by minutes, and in several cases by more than an hour.

Mirroring Icinga Packages in Air-Gapped and Restricted Environments

When hosting in a secure or corporate environment, Internet access is often restricted or blocked completely. While this makes sense from a security point of view, this introduces some challenges. For one, getting software packages. There are usually two approaches to the package problem in such an environment: Either allow a certain package mirror in the firewall, or run your own mirror within the restricted environment with access to another package server to mirror packages from.

Reality Bytes Is BACK: ft. Marc Petter on the Future of IT Jobs

Reality Bytes is back—and this time, we’re diving straight into the future of IT jobs. Tom, Oriana, and Dina are joined by Marc Petter (Senior Product Manager, Nexthink) to explore how AI is reshaping roles, workflows, and career paths. From automating repetitive tasks to the rise of AI agents handling entire processes, the conversation tackles what’s changing, what still requires a human touch, and how IT professionals can stay ahead. They unpack the difference between what can vs. should be automated, and what the new IT career ladder might look like in an AI-driven world.

From Honeycomb Customer to Bee: An Observability Champion's Journey

One of the most important and meaningful cornerstones that has defined and powered my career so far has been how I try to use my skills and talents to make the people around me stronger and achieve positive outcomes. My roles in tech have predominantly been in the ops engineering domain. I consider myself an ops engineer; a title I wear with pride.

Measure the business impact of every product change with Datadog Experiments

Modern product teams ship features constantly. Every change—whether it’s a new onboarding flow, pricing tweak, or UI adjustment—raises the same question: Did this improve the product? AI has changed the stakes entirely: As release cycles accelerate and code generation scales across every team, the volume of changes has outpaced most teams’ ability to measure their true value.

What Metrics to Monitor in Your Vibe Coded App

These days, using a tool such as Cursor, GitHub Copilot, Zed, or Claude makes it easier than ever to develop and deploy applications. You express your requirements, receive the completed project back as output, and there you have it! You now have an application that is in production and functioning. However, the surprise comes after the app has been deployed. When your app breaks or behaves abnormally, it may not be immediately obvious what is wrong or how to fix it.

Checkly Playwright Reporter: A Cloud Dashboard for Your Playwright Tests

The Checkly Playwright Reporter is an npm package that sends the results of npx playwright test to Checkly as a cloud test session, including traces, screenshots, videos, and full debugging context. Run your Playwright suite in CI or locally, and every result gets a persistent, shareable home in Checkly with AI-powered analysis, richer trace-derived views, and a direct path to production monitoring. It does not replace Playwright. It makes the output of Playwright much easier to work with.

Playwright Myths Busted: Speed, Flakiness, Production Monitoring & AI Test Generation

Playwright is too hard, too slow, and too flaky — right? In this webinar, Stefan busts six common end-to-end testing myths and shows how to reuse your Playwright tests as production monitors with Checkly. He covers codegen, trace viewer, UI mode, flakiness root causes (and fixes), and a quick look at Playwright MCP for AI-assisted test generation.

Agno Monitoring & Observability with OpenTelemetry and SigNoz

Learn how to implement end-to-end monitoring and observability for Agno-based AI systems using OpenTelemetry and SigNoz. In this video, we walk through instrumenting your Agno workflows, collecting traces, metrics, and logs, and visualizing everything in SigNoz to gain real-time visibility into performance, failures, and bottlenecks. You'll see how to move from basic logging to production-grade observability—so you can debug faster, optimize latency, and confidently run AI systems at scale.

Unified Logging for a Single Source of Truth

In Star Trek, the Borg are a cybernetic alien organism that forcibly assimilates other beings and technologies into its hivemind called “The Collective.” Each assimilated being or technology becomes part of the unified consciousness, with the villainous Borg Queen as the leaders. As the only independent thinker, the Borg Queen leads this rapidly adapting Collective.

Node Groups: Organize Your Infrastructure Into Reusable Views

When you’re managing a handful of nodes, the flat list in the nodes tab works fine. When you’re managing hundreds or thousands, it becomes a wall of hostnames. You end up applying the same filters repeatedly: all the production database servers, all the nodes in eu-west, all the Kubernetes workers in the staging cluster. The filters work, but they don’t persist, and there’s no way to share them with the rest of your team. Node groups solve this.

Telemetry Talks ep 3: OpenTelemetry with VictoriaMetrics observability signals

In this episode of Telemetry Talks, we explore OpenTelemetry observability signals—metrics, logs, and traces, and how VictoriaMetrics handles each of them with high performance, cost efficiency, and seamless integration. We briefly explain what each signal is, discuss common misconceptions, and share guidance on which signal to start with if you're new to observability. Together with our guests, both engineers at VictoriaMetrics, we walk through integrating VictoriaMetrics with the OpenTelemetry demo, showcase Grafana dashboards, and check the playgrounds for all three signals to see them in action.