Operations | Monitoring | ITSM | DevOps | Cloud

Sponsored Post

Fabrix.ai at Cisco Live 2026 Amsterdam

This post highlights the biggest Cisco AI Summit takeaways that came up again and again in Cisco Live conversations, and what they mean for teams operating AI in production. If you are following the broader AgentOps movement and the rise of agentic workflows, Fabrix.ai’s point of view is grounded in a core idea: AI agents create value only when they can be operated safely and consistently. A good starting point is here: Fabrix.ai’s approach to agentic.
Sponsored Post

What is a Real-Time Data Lake?

A data lake is a centralized data repository where structured, semi-structured, and unstructured data from a variety of sources can be stored in their raw format. Data lakes help eliminate data silos by acting as a single landing zone for data from multiple sources. But what's the difference between a traditional data lake and a real-time data lake? Some traditional data lakes use batch processing, which involves processing and analyzing a collection of data that has been stored over a specific timeframe. For example, payroll and billing systems that are handled on a weekly or monthly basis might use batch processing.

Behind the magic of auto-instrumentation (Grafana OpenTelemetry Community Call)

You add the OpenTelemetry Java agent, restart your app - and like magic, observability appears. But is it really magic? What’s actually enabled by default? What telemetry should you expect to see? What’s missing? And what might you want to tweak, tune, or even turn off?

How Fabrix.ai Agents Ensure Data Privacy & Security

As Agentic AI moves into enterprise environments, IT and security leaders face a critical challenge on how to leverage advanced LLMs without exposing sensitive data, intellectual property, or proprietary configurations to the cloud. You cannot build a self-driving, autonomous IT infrastructure if your security team blocks the deployment, and that’s exactly why the Fabrix.ai platform features an Enterprise-Grade LLM Integration architecture anchored by our built-in Data Security layer.

IT Cost Optimization Strategy: Eliminating Guesswork with Observability

IT organizations are being asked to reduce costs, manage risk, and maintain performance at the same time. Meanwhile, infrastructure complexity continues to grow, and vendor pricing changes are reshaping budget assumptions. Too often, an IT cost optimization strategy is shaped by incomplete data around sizing, licensing, refresh timing, and platform decisions. That uncertainty leads to overprovisioning, budget surprises, and reactive operations. Observability changes that equation.

Shopify outage on February 15, 2026

On February 15, 2026, Shopify experienced a widespread service disruption that impacted merchants and shoppers around the world. While the provider did not acknowledge the issue until 15:36 UTC, StatusGator’s Early Warning Signals detected unusual activity and alerted customers at 15:00 UTC, just minutes after the first outage reports began coming in. This incident highlights the importance of independent, real time monitoring.

SendGrid Status Monitoring: How to Track Email Delivery Outages

When SendGrid goes down, your transactional emails stop reaching customers. Password resets fail. Order confirmations vanish. Support tickets never arrive. By the time you notice, customers are already complaining. For DevOps and SRE teams, checking SendGrid status shouldn't be a manual process. It shouldn't wait until customers report it either. For a team sending 10,000 transactional emails per day, a 15-minute outage means roughly 100 emails that never arrived.

This Month in Datadog - February 2026

On the first episode of This Month in Datadog in 2026, Jeremy covers how you can protect agentic AI applications with AI Guard, stay up to date and collaborate during incidents with five Incident Management releases, and ship software with confidence using Feature Flags. Later in the episode, Kevin spotlights Datadog Data Observability, which enables you to detect data quality and pipeline issues early.

Grafana Campfire - Back to Basics - (Grafana Community Call - Feb 2026)

Grafana Campfire Community Calls are back. We are starting with *Back to the Basics.* Even though you heard it so many times, but some of you are either new to or not very experienced with terms such as monitoring, observability, metrics, tracing, profiling etc. The good news is that you're not alone, and we've got this!! This will be a perfect learning opportunity to gain understanding during this live call and ask questions if anything is not clear.

The rise of agentic AI in production: Can observability systems run themselves?

Sometimes the biggest shifts in technology aren’t about collecting more data — they’re about who (or what) gets to act on it. In this episode of “Grafana’s Big Tent” podcast, host Tom Wilkie, Grafana Labs CTO, is joined by Spiros Xanthos, Founder & CEO of Resolve AI, Manoj Acharya, VP of Engineering for Observability at Grafana Labs, and Cyril Tovena, Principal Engineer on the Grafana Assistant team, to discuss agentic AI in observability.

From RCA to Autonomous Ops: The Future of AI in Observability | Big Tent S3E7

SREs are famously skeptical of AI — so how do you convince them to trust agents in production? In this episode of Grafana’s Big Tent, Tom Wilkie talks with Spiros Xanthos (Resolve AI), Manoj Acharya (Grafana Labs), and Cyril Tovena (Grafana Assistant team) about agent-first observability. They unpack knowledge graphs, LLM reasoning, autonomous debugging, pricing models, and the “Claude Code moment” for observability. Is autonomous production ops closer than we think?

Your Data is Whispering and Needs a Human to Listen

If you have ever owned, operated, or supported a piece of technology, you have probably built a dashboard. Maybe it started as a quick chart to answer a simple question, then quietly grew into something more important. Dashboards are often created by the people who know the systems best, the ones who can wire together data sources and click all the right buttons. But those same builders are rarely trained in how humans actually interpret data.

Data Observability, AI Guard, Feature Flags, Ambassador program, and more | This Month in Datadog

See how you can ensure trust across the data life cycle in February’s episode of This Month in Datadog. Join us for a spotlight of Datadog Data Observability, which enables you to detect data quality and pipeline issues early, as well as remediate those issues with end-to-end lineage. Plus, we cover: Protecting agentic AI applications from real-time threats with Datadog AI Guard Staying up to date and reducing steps to collaborate with five new Incident Management releases Releasing software with confidence using Datadog Feature Flags.

Why Evidence-Backed RCA in Edwin AI Starts With Logs

A step-by-step look at how Edwin AI uses native LogicMonitor logs, topology, and context to turn root cause analysis from alert-driven inference into evidence-backed investigation. Most root cause analysis today starts with alerts and ends with explanations that sound reasonable but can’t be verified. An alert is fed into a language model, and the output looks like an answer. It often isn’t.

8 Years of Building Obkio: From Network Monitoring to Observability & Network Diagnostics

In 2016, Obkio was just an idea, but it was an idea born from a real problem. Before writing a single line of code, we conducted a market audit to understand why Network Performance Monitoring solutions weren't more mature. We interviewed banks, manufacturing companies, and service providers, and the answer was unanimous: the NPM tools on the market were too complex, and most businesses simply didn't have the internal resources to dedicate full-time to managing them.

Block Builder: a new Mimir Component (Mimir Community Call February 2026)

At today’s community call, we will hear from David Grant, one of the engineers who has brought a new component, the Block Builder, into Mimir. Using the Ingest Storage architecture in Mimir 3.0, the Block Builder takes over the block-building responsibility from the Ingester. This feature is experimental in Mimir today, but is rolling out to production inside of Grafana Labs now. This is a great time to introduce the component, discuss the motivation, and show where it fits in the larger architecture.

AI Agents in IT Operations: From Concept to Practical Value

Artificial intelligence has been a defining theme in IT operations for nearly a decade. Early AIOps initiatives focused on predictive analytics and anomaly detection, promising to reduce operational overhead and improve system reliability. While these capabilities delivered incremental value, they often fell short of transforming how operations actually functioned.

The Grafana Labs operating system: Introducing our Guiding Principles

Matt Toback is the VP of Culture at Grafana Labs. We published our original company values back in December 2020. We were a young company, growing fast, and fully remote. Our values at the time were aspirational, and painted a picture of the kind of company we wanted to be. Those values did real work and they mattered. You could hear them used in everyday conversations, and they helped get us to where we are today. But growth has a way of revealing gaps.

The Definitive AWS Outage Report 2025: Reliability Analytics and Cascade Impact

Amazon Web Services remains one of the most popular cloud providers, with 200+ services in 39 regions across the world. Like all providers, they have their share of outages. In 2025, IncidentHub detected 38 AWS outages, of which the one on October 20th had the most widespread impact affecting hundreds of SaaS providers simultaneously. Payments were disrupted, students lost access to classrooms, developer tooling degraded, and some IT teams experienced alerting gaps.

Talk to Your Logs: LLM-Powered Chat UI in DSDL 5.2.3

We are excited to announce the release of the Splunk App for Data Science and Deep Learning (DSDL) version 5.2.3. Since 2018, DSDL has served as an innovation hub for custom AI integrations within Splunk. In 2025, the release of DSDL 5.2.0 introduced customizable Large Language Model (LLM) integrations, bringing Retrieval Augmented Generation (RAG) and Agentic AI workflows to Splunk users.

Top tips: Think it's a recommendation? It might be an ad

Top tips is a weekly column where we highlight what’s trending in the tech world and list ways to explore these trends. This week, we'll be looking at ways we can spot ads disguised as recommendations in today's influencer era. These days, it's getting harder for me to distinguish between an ad and a recommendation.

Powering Security Innovation: Executive Q&A on Splunk Joining AWS Security Hub Extended

To succeed in the AI era, customers need fast, easy access to security solutions that can harness the power of agentic AI and deliver business outcomes. They need seamless access to their data for faster threat detection, simpler incident response, and reduced risk. They need technology vendors to work together and not in silos.

Millions of Metrics. Zero Clarity.

Millions of metrics. Zero clarity. That’s the reality many IT teams are facing today. As environments grow more complex, telemetry explodes. Millions of records generated every hour. Dozens of specialized tools for network, storage, Kubernetes, cloud, AI workloads. Each tool is good at its domain. But none of them answers the real question: Where should I focus right now? Fragmented visibility creates predictable failure modes.

How to Debug Code You Didn't Write (your AI did)

I was looking at a customer’s error report last week. A TypeError buried three callbacks deep in a checkout flow that made no sense. The code around it was clean, well-structured, and completely wrong about how the Stripe API actually works. Turns out it was vibe-coded. Someone prompted their way through the integration, it passed code review because it looked reasonable, and it worked fine right up until a customer’s card got declined for the first time. That’s the new normal.

12 Best SSL Certificate Monitoring Tools in 2026

An expired or misconfigured SSL/TLS certificate doesn’t fail quietly. Users get blocked by browser warnings, conversions drop, and teams scramble to diagnose whether the problem is expiration, a missing intermediate, an SNI/hostname mismatch, or a CDN edge serving an old chain. That’s why SSL certificate monitoring in 2026 is less about “check the expiry date” and more about continuous validation + fast alerting + enough context to fix the issue quickly.

Enable end-to-end visibility into your Java apps with a single command

Achieving end-to-end observability for applications is a top priority for organizations today, but instrumenting for both frontend and backend monitoring can be a significant hurdle. What complicates matters is that the SREs and DevOps teams responsible for deploying monitoring tools typically don’t own frontend code or have the context needed to safely modify it.

The Command Center Shift: Why the Future of Middleware is Unified, Predictive, and Transaction-Centric

Middleware is evolving beyond invisible plumbing into a strategic Command Center. The future demands unified management, predictive intelligence, and transaction-centric operations to move from reactive firefighting to operational mastery in 2026.

Build a Unified Operational Ecosystem with ServiceNow and Coralogix

During high-priority incidents, SRE teams frequently lose critical time switching between monitoring platforms and ticketing systems. Context switching like this forces engineers to manually update incident states by copying and pasting data. The inevitable result is increased risk of information gaps and slower Mean Time to Recovery (MTTR).

AI can do what now?! What an ethical hacker says about deepfakes and AI

Real-time camera deepfakes are no longer science fiction. High-fidelity, AI-generated impersonation may be advancing quickly — but that's not the only AI risk financial services companies should be thinking about. In this episode of AI Can Do What Now?!, Lisa Jones-Huff, director of security solutions architecture at Elastic, sits down with ethical hacker Freakyclown (FC) to explore what is technically possible today with AI, where reality still falls short of the hype, and what security teams should be worried about.

AI can do what now?! The real risks of AI in social engineering

What is the most immediate risk financial services companies face today? AI-enabled social engineering is already accelerating real-world attacks. Scale, personalization, speed, and automation are lowering the barrier for attackers while making fraud detection more complex for defenders. In this episode of AI Can Do What Now?!, Lisa Jones-Huff, director of security solutions architecture at Elastic, is joined by ethical hacker Freakyclown (FC), and principle solutions architect Joe Murin to explore what is actually happening right now — beyond the hype.

Observability Self-Hosted 2026.1 - Routing Insights

SolarWinds Evangelist Chrystal Taylor introduces the new routing insights feature in Observability Self-Hosted 2026.1. This first phase enhancement enriches routing table information with detailed context, including forwarding interface names, VRF data, next hop IPs, and timestamps. The update unifies BGP, OSPF, and EIGRP neighbors in a single dashboard, providing visibility into peer identity, flap counts, health status, and admin states.

9 Best Network Monitoring Tools for 2026

A Rapidly Evolving Network Landscape Demands the Right Monitoring Strategy Choosing the right network monitoring solution has become a mission‑critical decision for IT teams. In recent years networks have become increasingly hybrid, cloud‑distributed and reliant on remote connectivity. For enterprises with complex infrastructures, network monitoring is an essential tool.

Building Web API integrations that scale (5 key lessons)

I've used the Web API plugin with a wide range of APIs, and each one taught me something new. But before diving into building, I learned to pause and ask: What am I actually trying to display? Not what data the API can give me, but what would be useful on a dashboard? That shift in thinking — from ‘fetch everything’ to ‘fetch what matters’ — shapes how I approach every integration.

The Accountability Era: Decision Paths That Stand Up to Review

Modern IT environments depend on decisions that can withstand scrutiny. As systems grow more interconnected and outages carry greater cost, organizations must understand not just what actions teams take, but how those actions were formed. Operators need guidance anchored in evidence and aligned with business impact. Operational accountability now extends beyond correctness. Teams must show the information that shaped the decision, the options considered, and the reasoning behind the chosen path.

Colsubsidio transforms business process monitoring with Elastic Observability

Colsubsidio is one of the largest and most representative family compensation funds in Colombia. The organization manages and delivers essential social services to millions of users through a broad network spanning health, education, subsidies, recreation, tourism, credit, housing, pharmacies, retail supply, culture, and labor welfare.

The 5 best Jira reporting tools for 2026

Jira is the backbone of project management for thousands of agile teams worldwide. But while Jira excels at tracking issues and sprints, its native reporting can leave teams wanting more — especially when it comes to sharing insights, visualizing trends, and integrating data from across the business. That’s where dedicated Jira reporting tools come in. In this guide, we rank the 5 best Jira reporting tools on the market today.

The "Now" Problem: Why BESS Operations Demand Last Value Caching

Battery Energy Storage Systems (BESS) represent one of the most unforgiving environments for real-time data. Unlike a passive asset, a battery is a complex electrochemical system where safety and revenue are determined by split-second decisions. In this context, “average” latency can become a serious problem. Performance depends entirely on one key question.

What is Site24x7 Event Correlation? Causal AI and autonomous IT operations explained

When your distributed system goes down, your team spends days sorting through noise. That is revenue walking out the door. In this video, Jasper Paul breaks down the event correlation engine built to eliminate alert fatigue, and accelerate root cause analysis. Most monitoring tools still rely on basic time-window alert grouping — clustering alerts that fire at the same time and calling it correlation. But in a distributed system, outages are never isolated events. And grouping symptoms doesn't find root causes.
Sponsored Post

SAP Application Performance Monitoring (APM): Beyond Generic Metrics

Your enterprise APM tool shows SAP is using 90% CPU. The dashboard turns red. An alert fires. Now what? You open Dynatrace. You see the Java Virtual Machine metrics for your NetWeaver stack. You see HTTP response times for your Fiori apps. You see a spike in database calls. None of this tells you why VA01 takes 45 seconds to create a sales order. None of this tells you which custom ABAP report is consuming memory. None of this explains the short dump that crashed your pricing routine. This is the gap between generic APM and true SAP application performance monitoring. Your enterprise tools see the symptoms.

Smarter Alerts, Upgraded Solution Packs, and an Expanded Ecosystem for Hyperconnectivity

At Fabrix.ai, we are constantly pushing the boundaries of what Agentic AI and AIOps can achieve. We are happy to announce the release of Fabrix.ai platform version 8.2, packed with capabilities that make managing your IT environment more intuitive, secure, and perfect.

Grafana 12.4 TL;DR - The Final 12.x Release

As the final minor release in the Grafana 12 series, 12.4 builds on our shift toward scalable, as-code workflows and a dramatically improved user experience. From bi-directional Git workflows to smarter dashboard layouts and stronger governance controls, this release is all about helping teams move faster with less friction.

Fixing a production error with the Flare CLI and AI, from discovery to deploy

Using the Flare CLI and its agent skill to find, fix, and resolve a production error without leaving the terminal. The AI agent looks up the latest error on freek.dev via the Flare CLI, analyzes the stack trace against the local source code, generates a fix, deploys it using bash mode, and marks the error as resolved in Flare. Learn more.

Observability Self-Hosted 2026.1 - Server Configuration Comparisons

In this video, SolarWinds Evangelist Chrystal Taylor introduces server configuration comparisons, a new feature in Observability Self-Hosted 2026.1 and Server Configuration Monitor 2026.1. The key highlight is the ability to compare server configurations side by side, enabling users to identify differences in configuration files between nodes or against a defined ideal state. This new functionality aims to help users monitor configuration drift.

Incident Report: Exercises, Cleanups, and Evacuations

Every year, Honeycomb runs disaster recovery scenarios in multiple environments, including in production. Although each of our instances runs in a single region, on at least three Availability Zones (AZs), we have multiple plans for partial regional failures, and particularly, zonal failures. One of these tests was run on December 5th, and after its successful completion came its cleanup steps.

Alerting Is a Socio-Technical System

In the previous posts, we’ve looked at how alert noise emerges from design decisions, why notification lists fail to create accountability, and why alerts only work when they’re designed around a clear outcome. Taken together, these ideas point to a broader conclusion. That alerting is not just a technical system, it’s a socio-technical one. Alerting systems encode assumptions about how people behave, how responsibility is distributed, and how decisions are made under pressure.

AI performance reviews for your app with the Flare CLI

The Flare CLI connects to your Flare performance monitoring data and uses AI to turn it into actionable insights, right from your terminal. In this video, you'll see how a single command pulls your real performance data from Flare, then generates a full review: identifying slow endpoints, spotting error trends, and suggesting concrete fixes. Links.

Best Website Monitoring Tools for Compliance and Security in 2026

Compliance audits used to be annual fire drills. Teams would scramble for weeks gathering screenshots, pulling logs, and hoping nothing slipped through the cracks. That approach no longer works when regulations like GDPR and HIPAA require continuous documentation and real-time evidence of security controls. Website monitoring tools designed for compliance have evolved to address this reality, automating evidence collection and flagging issues before auditors ever arrive.

Claude Code + OpenTelemetry: Per-Session Cost and Token Tracking

I was looking at our Claude Code spend in the Anthropic console the other day. Aggregate cost, aggregate tokens — no breakdown by developer, no breakdown by session. I knew my Hackathon team had been using it heavily on building out new features for the OpenTelemetry Distro Builder. But heavily how? I had no idea. Turns out Claude Code has been emitting OpenTelemetry signals the whole time. Per-session cost, token counts, every tool call it makes on your codebase.

Digital Employee Experience Is Now Core to IT - Recognized by Analysts, Reinforced by Customers

Over the past few years, Digital Employee Experience (DEX) has moved from emerging concept to essential capability for modern IT organizations. The conversation has changed. IT is no longer measured only by system uptime or ticket resolution. Today, success is defined by how technology actually performs for employees — and how consistently organizations can deliver productive, friction-free digital work.

Catch Every Moment in Kubernetes: Splunk's Observability Advantage

Discover why real-time, unsampled observability is critical for Kubernetes environments with Stephane Estevez from Splunk at KubeCon Europe 2026. Learn how Splunk’s unique approach helps you catch every important moment—even when containers vanish in milliseconds. Watch now for expert insights on cloud-native monitoring, observability, and Kubernetes best practices!

VictoriaMetrics February 2026 Ecosystem Updates

This month, we’re thrilled to see OpenAI using the VictoriaMetrics Stack internally — including VictoriaMetrics, VictoriaLogs, and VictoriaTraces — in their Harness engineering experiment, as shown in their architecture diagram. It’s a great way of combining observability and AI agents.

Grafana 12.4 release: faster and easier data visualization, observability as code updates, and more

As we gear up for Grafana 13, the next major release of the open source data visualization platform that we’ll announce at GrafanaCON this April, our engineering team is still shipping some powerful new features along the way. Case in point: Grafana 12.4 is officially here, and there’s a lot to be excited about. The latest minor release includes a ton of updates that help you build and design dashboards faster than ever, as well as manage and scale those dashboards seamlessly over time.

Cut Costs, Not Visibility. Use S3 for Low-Cost Log Retention and Faster Response.

Why pay for continuous ingestion of data you rarely use? Learn how to maintain a lean data strategy by keeping long-term logs in cheap S3 storage, while retaining the power to "promote" specific slices into Splunk whenever an audit or investigation arises. See how Promote for Amazon S3 gives you the speed of local indexing without sacrificing speed in investigations.

AlphaFold, Office Politics, and Mustafa Suleyman's Two Futures (w/Benedict Lelijveld)

In this episode, Benedict Lelijveld joins us to unpack what it feels like to start a career in an era shaped by COVID disruption, hybrid work, and accelerating AI. We dig into his writing on Mustafa Suleyman and the idea of “pessimism aversion”: holding genuine hope for breakthroughs (from personal AI to advances in biology) while staying clear-eyed about risks like misuse, weak regulation, and who really benefits. Benedict also reflects on what early-career professionals lose when work becomes too remote—and why protecting your voice, curiosity, and craft matters more than ever as automation spreads.

Case Study - Troubleshooting Storage Failures in a VMware ESXi Infrastructure

IT problems happen even in the best architected infrastructure due to configuration changes, failures, upgrades and such. How quickly and effectively you can detect and resolve such problems dictates how efficient your IT operation is. Today, I’ll cover how eG Enterprise helped us troubleshoot a hardware failure (a storage battery failure) that that caused a cascade of failures in a VMware ESXi infrastructure.

Microsoft SCOM Tips & Tricks

This one is for all the Microsoft SCOM geeks out there — 99 practical tips & tricks to make managing SCOM way easier. The tips compiled here draw from community experts, SCOM-focused blogs, Microsoft’s official documentation, and the hands-on experience at NiCE. You may already know some of them, but having them all organized in one place makes it easy to reference and put them into practice.

Notes from the Field: XenServer falling back to file-based licensing when using LAS

Citrix has been transitioning products toward License Access Service (LAS) as the modern licensing method. Unlike traditional file-based licensing, LAS introduces service-based communication between products and the Citrix License Server. As of 15 April 2026, LAS becomes the mandatory licensing method for supported products. Environments still relying on file-based licensing will need to transition before that date.

The Evolution of Digital Employee Experience (DEX) | How IT Is Transforming the Workplace

Digital Employee Experience (DEX) is transforming how IT teams support employees, improve productivity, and drive business outcomes. In this video, we explore the evolution of DEX—from traditional reactive IT support to proactive, experience-driven operations that empower both employees and organizations.

The Grafana Cloud identity blueprint: balancing security and scale

If you've ever rolled out Grafana Cloud to a growing engineering organization, this pattern may sound familiar: Everything feels simple at first. You invite a few teammates, give them access, and dashboards start appearing. Then the team grows. Then the number of stacks grows. Over time, a model that once felt fast and empowering starts to feel risky, difficult to understand, and even harder to undo. This post is about avoiding that moment.

Measure and improve mobile app startup performance with Datadog RUM

Mobile app users form opinions quickly. A slow or inconsistent startup experience can frustrate them before they reach the first screen, increasing the likelihood that they abandon the app or fail to complete key actions such as signing up or making a purchase. However, app teams often lack reliable signals that explain why startup performance varies, making it difficult to improve the user experience.

From Alerts to Answers: Introducing Coralogix Cases

Modern incident response doesn’t fail due to a lack of alerts firing. It fails because teams are overwhelmed by the sheer volume and the lack of context around them. Today, most observability and monitoring platforms generate a flood of alerts. Each one is triggered independently, even when they are symptoms of the same issue. Engineers are left trying to reconstruct the full picture while jumping between dashboards, Slack messages, and tickets.

Reducing Risk When It Matters Most: How Verifiable Guidance Protects Critical Operations

When a major incident strikes, every second becomes a decision point. Service degradations accelerate. Customers feel the impact. Revenue and reputation hang in the balance. In these moments, IT teams do not need abstractions or probabilistic guesses. They need guidance they can validate and decision paths they can explain with confidence long after the incident is resolved. Hybrid environments are too complex for intuition, and the repercussions of an incorrect action are significant.

Observability Self-Hosted 2026.1 - Additional Cloud Support

SolarWinds Evangelist Chrystal Taylor demonstrates the new cloud entity support features in Observability Self-Hosted version 2026.1. The update adds monitoring capabilities for MySQL and PostgreSQL databases on Google Cloud Platform, GCP load balancers, Azure functions, AWS Elastic Kubernetes Service, and AWS Lambda functions. She provides a guided walkthrough of the dashboard interface, showing how users can monitor various metrics including database performance, network traffic, latency, function execution counts, system usage, and costs across different cloud platforms.

Monitoring and Optimizing a Hybrid Cloud Environment | WhatsUp Gold

This webinar focuses on Monitoring and Optimizing a Hybrid Cloud Environment. Downtime is an expensive inconvenience. Yet many IT teams still face monitoring blackouts due to rigid licensing models and outdated failover strategies. In this session, we’ll introduce a smarter approach: High Availability by Design. Whether you're scaling operations or modernizing infrastructure, this session will enable you with the tools and insights to build a resilient, future-ready monitoring strategy.

Bindplane + VictoriaMetrics: Unified Telemetry for Metrics, Traces, and Logs at Scale

We’re excited to announce new native Bindplane destinations for the VictoriaMetrics ecosystem. It’s now easier to collect, process, and route OpenTelemetry metrics, traces, and logs at scale. You can directly connect VictoriaMetrics’ high-performance storage engines to Bindplane’s vendor-neutral, OpenTelemetry-native telemetry pipeline.

Reinventing the Incident Responder's Day: Empowering Tier 2 SOC Analysts with Splunk's Agentic SOC Platform

The Tier 2 SOC Analyst or the Incident Responder (often hailed as the "Sherlock Holmes of the network") faces an increasingly complex and relentless digital landscape. In a world where analysts are being overwhelmed by alerts, held back by fragmented, manual tooling and inefficient workflows, incident responders are charged with the critical task of identifying, analyzing, and mitigating security threats.

I let Claude investigate a production incident with Honeybadger's MCP server

In this demo, Kevin shows how you can use Honeybadger's MCP server with Claude to investigate a production incident — going from a natural language prompt to a complete incident dashboard in minutes. Honeybadger is an application health monitoring platform that helps developers catch errors, track performance, and stay on top of incidents. The MCP server lets AI assistants like Claude query your Honeybadger data directly, so you can investigate issues conversationally without digging through dashboards manually.

Why Site Performance Metrics Are the Missing Piece in Your Local SEO Strategy

Most conversations about local SEO start and end with Google Business Profiles, reviews, and citations. And sure, those things matter. But there's a whole layer of the ranking equation that gets ignored by marketing teams because it lives on the ops side of the house. Site performance, server response times, uptime consistency, and how your infrastructure handles traffic spikes during peak local search hours. These aren't just IT concerns anymore. They have a direct line to whether your business shows up when someone searches "plumber near me" at 9 PM on a Tuesday.

Freshping is retiring-ensure your monitoring remains uninterrupted

Freshping has announced that it will retire its service on March 6, prompting many organizations to reassess how they maintain uptime visibility. When monitoring stops, it doesn't mean your issues stop too; it’s a period of forced blindness. This sunsetting period exposes a core vulnerability: Digital visibility is only as strong as the platform supporting it.
Sponsored Post

What to Say When Things Break: Outage Notification Templates for Ops Teams

This practical guide explains what to say when systems break, offering ready-to-use outage notification templates and best practices to help ops teams communicate clearly during incidents. Learn how effective outage communication can reduce confusion, manage user expectations, and maintain trust during service disruptions.

DNS blocklist monitoring now available to all Oh Dear users

Your domain is on a spam blocklist. Password reset emails aren't arriving, order confirmations land in spam, and customers are complaining that "your site doesn't work." By the time you hear about it, the damage has been building for days. We've shipped DNS blocklist monitoring to catch this early. Oh Dear now checks your domain against 11 major blocklists and notifies you the moment you're listed, with direct links to get removed.

The limits of MCP and how Olly surpasses them

Model Context Protocol (MCP) servers act as adapter layers between clients and AI based workloads. MCP installation into an IDE, such as Cursor, brings a wealth of information directly into the developers primary tool, minimizing context switching and, especially in the world of observability, bringing telemetry closer to the code. MCP is not without its limits. These limits initially seem trivial, but in time, some of the inherent limitations to a basic MCP implementation become apparent.

The Benefits of Distributed Network Monitoring for Multi-Site Businesses: Why Hybrid Work Changed Everything

Most companies rewired how their people work, not once but twice. First for remote, then for RTO (Return to Office). Their network monitoring never caught up. So, what happened? IT teams are managing a network that spans headquarters, branch offices, home setups, and cloud apps with tools that still assume everyone's connecting back to one place. When something breaks (and it will), nobody can pinpoint where. IT takes the blame. Users lose productivity. Leadership loses patience.

Using Core Web Vitals in Honeycomb Frontend Telemetry

Google's Core Web Vitals (CWVs) measurements have been used by web administrators and SREs to review frontend application performance metrics, and have been factored into Google's page rankings since 2021. They are also used in Google Analytics, which crawls websites and evaluates performance metrics over a period of multiple days, and with various frontends (desktop web, mobile web, etc.) to establish how well a website performs in production.

Evaluating our AI Guard application to improve quality and control cost

This article is part of our series on how Datadog’s engineering teams use LLM Observability to build, monitor, and improve AI-powered systems. Organizations are building AI agents that help users automate work, analyze data, and interact with complex systems through natural language. As these agents become more capable, they also become more complex and exposed to risks such as prompt injection, data leaks, and unsafe code execution.

AI Assistant vs Skylar Advisor

What happens when AI understands your entire environment? With Skylar Advisor, you move beyond prompts and responses and get prioritized guidance based on real operational impact. Skylar Advisor identifies what matters most, explains why it matters, and provides clear next steps so even junior IT professionals can operate with confidence.

A 4-Month Bug Fixed in <10 Minutes with Olly

In today’s highly interconnected systems, the subtle relationships between services are rarely obvious. Modern, complex architectures generate telemetry that functions less as “flashing signs” and more as faint “breadcrumbs” to be followed across a vast network of signals. In 2025, about two-thirds of outages involved third-party systems like cloud platforms and APIs.

How Coralogix's Data Pipeline Turns Obscure Data into Clear Business Value

Observability data arrives as a flood of signals, full of potential, but rarely consistent. Error messages and debug logs can reveal what businesses care about: reliability, customer experience, and revenue. The challenge is turning raw technical events into information the whole organization can act on. Many observability systems store data first and structure it later, forcing teams to rebuild context in dashboards and queries, often duplicating logic across services.

Heartbeat behind the metrics | Hemachand on what visibility really means

What happens when observability grows faster than infrastructure? In this episode of Heartbeat Behind the Metrics, Hemachand Munagapati, Product Manager at Site24x7, reflects on over 15 years with the product and how the idea of a single pane of monitoring has shaped everything that followed.

Icinga Notifications: Improving Alerting and Incident Workflows Webinar

Modern monitoring is not just about alerting, it’s about reducing noise, protecting on-call engineers from burnout, and improving incident MTTR through context-aware workflows. Icinga Notifications helps teams achieve just that with configurable, extensible alert processing built for scale. This webinar was held on February 17, 2026. We dive into the brand-new Icinga Notifications capabilities, a modern approach to alerting and incident workflows tailored for complex, dynamic infrastructures.

Why Nexthink Intelligence Is a Game-Changer for IT Teams

Nexthink Intelligence transforms digital employee experience (DEX) for modern enterprises. Learn how IT teams can leverage real-time analytics, proactive insights, and automation to improve user productivity, troubleshoot issues fast, and deliver better workplace tech experiences. Learn more at nexthink.com.
Sponsored Post

Cisco Live'26 - Amsterdam: Aligning with the AI-Driven Future

The energy at Cisco Live EMEA in Amsterdam (February 9-13, 2026) was primarily driven by groundbreaking AI announcements, & the event provided Fabrix.ai an opportunity to strengthen our strategic position alongside Cisco and Splunk ecosystems. The event’s focus on AI, highlighted by the recent Cisco AI Summit, emphasizes a clear market direction in which Fabrix.ai is perfectly poised to accelerate innovation.

Database Partitioning: Types, Strategies, and When to Use Each

How database partitioning works in PostgreSQL and MySQL. Range, list, and hash partitioning with SQL examples and guidance on when to partition vs shard. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Database Sharding: How It Works and When You Actually Need It

How database sharding works, common strategies (hash, range, directory), shard key selection, and the operational cost of running a sharded database in production. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Trello outage on February 19, 2026

On February 19, 2026, Trello users around the world began experiencing issues loading boards and accessing their workspaces. StatusGator received the first outage reports at 14:24 UTC and triggered an Early Warning Signal at 14:28 UTC. Trello did not officially acknowledge the incident until 15:08 UTC, after user reports had already subsided. This incident highlights how real time user reports and Early Warning Signals can identify widespread service degradation before providers confirm a problem.

Release v2.9: OTEL Logs, Database Functions, SNMP Functions and more.

What’s New in Netdata v2.9 In this video, we walk through the biggest updates in Netdata v2.9, including: Top Tab Database Functions to analyze slow queries and performance bottlenecks without logging into your database SNMP Network Interfaces Function for real-time visibility into network interfaces Microsoft SQL Server Collector with richer MSSQL metrics OpenTelemetry Logs Ingestion to correlate logs and metrics in one place.

Cost Optimization for AI Workloads: From Visibility to Control

ITOps teams can achieve cost management of AI workloads with an observability platform that connects AI usage and performance with cloud spend for clear visibility and predictability. Behind the buzz around artificial intelligence, or AI, many companies are discovering the hidden and compounding costs of AI adoption.

How LogicMonitor Delivers AI Cost Optimization

LogicMonitor delivers AI cost optimization by unifying infrastructure telemetry, AI-specific signals, and cloud financial data into a single workflow, so teams can move from visibility to continuous, operationalized cost control. In Cost Optimization for AI Workloads: From Visibility to Control, we explored why AI workloads introduce new layers of cost complexity—from GPU-heavy compute and token-based pricing to distributed infrastructure that obscures true spend.

What feels different about enterprise IT operations today compared to even 3-5 years ago?

Speed isn’t the problem. Speed without shared visibility is. AI compressed release cycles, multiplied dependencies, and pushed accountability to teams who no longer own the full stack. The result? Faster change. Slower resolution. Higher risk. This is why MTTR is moving the wrong way...and why observability has to evolve. : Amit Rathi.

The Hidden Operational Risk Financial Institutions Can No Longer Ignore

Why digital experience is now a regulatory priority In regulated industries like financial services, even minor technology friction can quickly become a regulatory risk. Gaps in visibility, slow systems, and inconsistent performance can trigger audit findings, SLA breaches, and increased compliance scrutiny.

Identify untested code across every level of your codebase

As organizations scale their services and adopt AI-assisted coding, code changes are landing faster and in greater volume than ever before. While this powerful new practice is accelerating the pace of development, it is also increasing the likelihood that untested code may slip into repositories without detection. What makes this problem even worse is that most teams have no reliable way to know which code is covered by tests.

Nexthink Workspace - Where DEX Work Happens

Workspace is the new space for managing DEX inside the Infinity platform. It brings signals, analysis, guided actions, personalized answers, and chat history into one clean and intuitive full-screen experience. Workspace turns everyday questions into insight and action so teams can investigate faster and make better decisions without complexity or technical query languages. Its enhanced reasoning engine is fully NQL certified, delivering accurate explanations and deeper context across every investigation.

Is Your File Integrity Monitoring Outdated? Kubernetes Needs Runtime FIM

If your file integrity monitoring (FIM) still relies on scheduled scans… it was built for static servers — not Kubernetes. In cloud-native environments, traditional FIM creates detection delays, wasted CPU, excessive I/O, and alert noise. And if a malicious process modifies a file and exits before the next scan? You might miss it entirely. In this video, we break down: Modern runtime FIM works differently. Instead of scanning everything on a schedule, it.

Event Intelligence is Replacing Monitoring - Here's Why That Matters

For more than two decades, monitoring has been the foundation of IT operations. Organizations invested heavily in tools designed to collect metrics, visualize performance, and trigger alerts when thresholds were breached. This model was effective in an era when infrastructure was largely static, workloads were predictable, and system dependencies were relatively easy to trace. That environment no longer exists.
Sponsored Post

Forwarding Microsoft SCOM Alerts to the Service Desk

Modern IT operations rely heavily on monitoring solutions like System Center Operations Manager (SCOM) to detect issues across servers, applications, and services. While SCOM excels at generating alerts, organizations often struggle to ensure these alerts translate into actionable incidents in their IT Service Management (ITSM) platforms. Without proper integration, critical alerts may be missed, tickets may be created manually, and incident resolution can be delayed.

The Next Era of Observability: Founders' Reflections - Additional Q&A

What happens when the people who helped define observability take a hard look at AI? That’s what Honeycomb co-founders Christine Yen (CEO) and Charity Majors (CTO) dug into during this webinar, starting with the early days of observability (back when it wasn’t even a category yet).

The Current State of Content Negotiation for AI Agents (Feb 2026)

The web was built for humans, but now the agents are taking over. Humans look at a web page and see content rendered by their browser. AI agents see 180,000 tokens of nav bars, footers, and div soup — burning through their context window on junk that makes them slower and stupider. The web needs to evolve, and we as developers are driving the shift. AI agents like Claude Code, Cursor, Codex, and Gemini are how we interact with documentation, CLIs, and products today.

SSL/TLS Certificate Lifetimes to Reduce to 47 Days

Last year it was widely reported that the CA/Browser Forum had voted to significantly reduce the lifespan of SSL/TLS certificates over the next 4 years, with a final lifespan of just 47 days starting in 2029. The first reduction will come into action in a few weeks, on March 15th 2026, accelerating the need for organizations to automate their monitoring and renewal processes around certificate expiry.

Make use of guardrail metrics and stop babysitting your releases

Modern CI/CD pipelines have automated the hard work of building, testing, and deploying our code. But for many teams, that’s where the automation stops. The most critical part of a release, turning a new feature on for real users, is still a stressful, manual process. An engineer cautiously ramps up traffic to 5%, then 10%. The whole team stares at dashboards, trying to see if anything breaks. If something does, they scramble to manually roll back.

Reliability Has Outgrown the Systems Supporting It

Service reliability has outgrown uptime checks and component-level tools, creating friction that slows response, increases toil, and wears teams down. Uptime checks can pass, high availability can be in place, and users still can’t complete basic actions. Pages load slowly, latency spikes, and requests stall — all without a single system flagged as down. Availability measures whether a service is running.

Signal-Driven Error Monitoring: Detecting and Debugging Reactive Failures in Angular

Angular's Signal-based reactivity model represents one of the biggest paradigm shifts the framework has seen since Ivy. By replacing the asynchronous push-pull model of RxJS with synchronous, localized updates, Signals make state management both simpler and faster. But this new simplicity hides a subtle danger: when something breaks inside your reactive graph, it often does so silently. A computed value might stop updating. An effect might fire indefinitely.

How to write annotations in Kubernetes with JSON for Datadog Autodiscovery | Datadog Tips & Tricks

Pod annotations in Kubernetes with invalid JSON syntax can prevent Datadog Autodiscovery from detecting integrations, resulting in missing metrics and gaps in monitoring. Watch this video for a step-by-step process to write annotations: Note: This video focuses on Datadog Autodiscovery v2 syntax.

React Native SDK 8.0.0 is here

We just released React Native SDK 8.0.0, here's what's new, and what's changed. It's been a while since the last major version. The last major release, 7.0.0, shipped on September 2, 2025. After 13 minor and 2 patch releases, it's finally time for a new major version to land: 8.0.0. This version is a maintenance and capability major. This means we: It should be straightforward to upgrade, but check the migration guide for your setup.

Bindplane Blueprints for Elasticsearch: Production-Ready NGINX Log Pipelines for Kibana

We've just released new and easy-to-use Bindplane blueprints designed specifically for Elasticsearch as a destination. These blueprints empower teams to quickly transform raw events such as those from NGINX access and error logs into clean, structured, and ECS-compliant data optimized for high-performance visualization in Kibana.

What is OpenTelemetry and Why Do Organizations Use it?

Mining for information about environments is like trying to find gold. Looking for gold can be sifting through silty waters or blasting through a mine. In some cases, the gold nuggets are so small as to be almost invisible, some things look like gold but aren’t, and others are larger nuggets where the miner strikes it rich. Trying to understand how a distributed system works means sifting through vast amounts of telemetry, looking for patterns.

The 2025 Wake-Up Call for Engineering Teams

For years, organizations tried to solve operational pain by collecting more data, adding more dashboards, and consolidating more tools. But 2025 exposed a deeper mismatch. Systems had become more distributed, AI-assisted, and interdependent than ever before, while teams had shrunk and on-call pressure had intensified. This wasn’t a tooling failure. It was an architectural and cognitive one.

Use AI to turn any JSON API into a dashboard in minutes with the Infinity data source plugin and Grafana Assistant

The internet is full of fascinating data just waiting to be visualized and queried. And with the latest update to Grafana Cloud, you can start doing it in minutes. Through public APIs, you can access information about global earthquake activity, weather forecasts, music catalogs, and millions of other datasets. And then there's all the data that sits inside company APIs, partner services, and internal platforms that power everyday products and operations.

Who Watches the Vibe Coder?

AI didn’t replace developers. It replaced the part where you were forced to understand what you just shipped. Now you can prompt your way to a feature, skim the diff, and merge something that “seems reasonable.” And then production does what production always does: finds the one weird browser + one slow network + one user flow that turns your “reasonable” code into a bonfire. So who watches the vibe coder?

The 5 Pillars of DEXOps Explained: Turning Digital Experience into Business Impact

Most IT leaders agree on one thing: digital employee experience matters. What is less clear is how to operationalize it in ways that deliver measurable business outcomes. Many organizations invest in tools and dashboards, launch experience initiatives, and even measure sentiment. But without an operational model that connects employee experience to core business objectives, IT teams often stay stuck in reactive support. DEXOps changes that.

Kiro Can Now Use Lightrun via MCP

AI code assistants transformed how software is written. They did not transform how it fails. Today, we’re announcing a new MCP integration between Lightrun and Kiro. Kiro now gains live runtime visibility through the Lightrun MCP, grounding AI-assisted development in how code actually behaves at runtime. Kiro, the AI coding assistant from the teams at AWS, is built for velocity and intuition. It helps teams move from specification to production faster by turning intent into working code.

How to Make AI-Generated Code Reliable with Runtime Context

AI coding assistants like Cursor and Claude Code are driving massive productivity gains, yet they have introduced a critical validation gap in the software delivery lifecycle. While these tools excel at generating syntax, they lack visibility into live production environments. This article explains how Runtime Context, the missing nervous system of AI development, secures production by moving from probabilistic guessing to deterministic, live code validation.

Move to ManageEngine Site24x7 to elevate your website monitoring

Organizations using entry-level tools face limited visibility, slow issue response, and scalability challenges that increase downtime risks. ManageEngine solves this with its enterprise-grade, AI-powered platform, delivering end-to-end digital experience monitoring in cloud and On-Premise versions. Switching isn't only easy, it brings predictive intelligence, global precision, and seamless growth support to your workflows—protecting your revenue while improving your operational excellence.

YouTube Outage (Feb 17, 2026). What Happened?

On February 17, 2026, YouTube went down for users worldwide. Starting around 8:00 PM ET, the platform's homepage, Shorts feed, sign-in system, smart TV apps, YouTube Music, and YouTube Kids all stopped working. Over 21,000 reports were logged on IsDown alone. The error message was the same everywhere: "Something went wrong." For consumer users, it was an inconvenience. For businesses that depend on YouTube — content teams, advertisers, media companies, live streamers — it was a blind spot.

Top 6 Cloud Monitoring Challenges in Hybrid & Multi-Cloud Environments

Hybrid and multi-cloud monitoring breaks down when teams can’t connect signals to customer impact fast enough to act. Hybrid and multi-cloud sound simple: run some workloads in public cloud, keep some on-premises, and connect it all. But in practice, you’re managing dependencies across teams and systems, tools that don’t share context, and incidents that refuse to stay in one place.

AI Query Assist for SolarWinds SQL Sentry

Rewrite inefficient SQL Server queries in seconds—not hours. In this demo, we show you how AI Query Assist in SolarWinds SQL Sentry transforms the way you tune performance. Watch how to take a problematic query from the "Top SQL" view and use generative AI to instantly generate optimized rewrites and uncover missing indexes. What you will see: Instant Optimization: How to automate query rewriting and get plain-language explanations of the logic changes.

Designing Alerts for Action

In the first two posts of this series, we explored how alert noise emerges from design decisions, and why notification lists fail to create accountability when responsibility is unclear. There’s a deeper issue underneath both of those problems. Many alerting systems are designed without being clear about the outcome they’re meant to produce. When teams don’t explicitly decide what they want to happen as a result of a signal, they default to the loudest option available.

Unlimited Team Sizes for All

Starting from today, Healthchecks.io users on all plans (Hobbyist, Supporter, Business, Business Plus) can invite an unlimited number of users into their projects. Previously, the limits were: 3 team members for Hobbyist and Supporter, 10 team members for Business, and unlimited team members for Business Plus. From now on, it is unlimited for all.

Improve performance and reliability with APM Recommendations

SREs and application developers rely on telemetry data to understand and improve their systems. As organizations scale and evolve, those systems generate an ever-growing volume of metrics, logs, and traces. But more data alone does not make it easier to improve performance or reliability: Identifying meaningful optimizations still requires careful investigation and analysis.

Turn Raw Data into Reliability by Changing Performance Perspectives

In a global microservices architecture, technical performance initially presents as a chaotic stream of disconnected telemetry. For a Technical Program Manager (TPM), success depends on the ability to move past these disconnected individual data points to identify stable patterns. If they have services entering critical states, looking at individual logs or traces is inefficient. Protecting system reliability requires an engine that can automate pattern recognition at scale.

Introducing: Checkly Agent Skills

AI coding agents are excellent at writing code. Ask Claude Code, Codex, or Cursor to add a feature, and it just works. At Checkly, we were ready for the new agentic world from the start! Monitoring as Code means your entire monitoring setup lives in your repository. API Checks, Browser Checks, alert channels, status pages; everything is defined in code, managed with the Checkly CLI, and version-controlled like any other part of your stack.

How a Singleton Pattern Broke Our Django Logging

With modern tooling and agentic coding assistants, straightforward bugs are almost a relief. If a test can catch it, or a user can reproduce it, chances are you can squash it quickly. The harder category — and the one worth writing about — are the bugs where everything looks correct. Your code runs, no exceptions are thrown, your debug statements confirm the right functions fire at the right times, and yet nothing works.

Unlocking business resilience with full-stack observability in hybrid IT environments

For CIOs and technology leaders across the Gulf Cooperation Council (GCC), full-stack observability is a strategic lever for achieving faster ROI, operational resilience, and digital maturity. By integrating AI-powered insights and automation, IT leaders can streamline operations and align technology outcomes with business goals. Demonstrating ROI within tight timelines is critical, as is leveraging observability to maintain competitive advantage in a rapidly evolving market.

OpenTelemetry support for .NET 10: A behind-the-scenes look

At Grafana Labs, we are fully committed to the open source OpenTelemetry project and are actively engaged with the OTel community. Many Grafanistas spend a large proportion of their time contributing directly to OpenTelemetry upstream projects, helping make observability more powerful, reliable, and accessible for everyone as part of our big tent philosophy.

Teaching AI How to Refinery

At the beginning of February, we released v3.1 of Refinery, our advanced, tail-based sampling solution. The new version comes with more performance enhancements, bug fixes, and a few new pieces of telemetry. In tandem with the 3.1 release, we also released a new tool for our MCP server which helps your AIs understand Refinery, and how Honeycomb handles sampling.

The New Standard for Operational Decision-Making: Why Trustworthy Guidance Matters More Than Ever

Modern IT operations sit at the center of revenue, customer experience, and business continuity. Every decision engineers make influences far more than the technical domain, which is why teams need intelligence they can validate, reasoning they can understand, and guidance they can rely on. In an environment shaped by rapid change and expanding dependencies, decisions must be grounded in accuracy and context to avoid unnecessary risk.

From random chunks to real code - wiring up Next.js source maps in Sentry

When you ship a Next.js app, the React and TypeScript you write aren’t what your users actually download. Next.js compiles, minifies, splits, and shuffles your code into chunks in ways that are great for performance and terrible for debugging. This post shows you how that pipeline works, how source maps and debug IDs connect it all back to your original code, and how to wire things up so Sentry shows you real file names and line numbers instead of an unreadable stack trace.

What is the Model Context Protocol (MCP)

The Iron Man’s J.A.R.V.I.S. is the artificial intelligence (AI) that almost every person wants to see. A conversational technology that answers questions like a friend would. The rise of large language models (LLMs) almost seems to give people the friendly robotic sidekick that generations of children grew up dreaming about.

Already Love Scout APM? We Have Integrated Error Monitoring!

The error monitoring scene has changed a ton over the past few years. We've gone from basic exception tracking to fully integrated platforms that correlate errors with performance metrics and logs. We’ve even got AI-powered debugging! But in the midst of the AI explosion, some things remain unchanged and most teams are still drowning in data with little actionability.

Productivity in the Age of AI - DEXOps 1:1 with Scott Pope

In the first of a new rotating expert series, Scott Pope (Nexthink's Director of Value Advisory) joins to explore DEXOps, productivity, and why DEX has firmly entered the boardroom conversation. We talk about how the market has evolved, what AI is really changing, how to communicate value to senior leaders, and the story behind the DEX Productivity Report. Also: Arsenal. Briefly. And yes, Tom still needs to update the show music. Hang in there.

OpenTelemetry Production Monitoring: What Breaks, and How to Prevent It

OpenTelemetry almost always works beautifully in staging, demos, and videos. You enable auto-instrumentation, spans appear, metrics flow, the collector starts, and dashboards light up. Everything looks clean and predictable. However, production has a way of humbling even the most carefully prepared setups. When real traffic hits, and it always spikes sooner or later, you start seeing dropped spans.

Bindplane | Blueprints for ClickHouse: Optimize Telemetry Before It Hits ClickStack

Chelsea from the Customer Success team walks through the Bindplane Blueprints for ClickHouse guide — showing how to optimize logs, metrics, and traces before they land in ClickStack. You’ll see how to: ClickHouse is powerful. But raw telemetry at scale gets expensive fast. Bindplane acts as the control plane for your OpenTelemetry infrastructure. Blueprints let you apply production-ready processing logic instantly without YAML sprawl or config drift.

Microsoft Entra ID secrets and certificates: One of the most preventable causes of enterprise application failures

All it takes to make critical applications to fail, customer portals to crash, and render internal systems inaccessible is just one expired client secret. Not a sophisticated cyberattack. Not a worldwide cloud service outage. Just a single credential that quietly expired while everyone focused on "more important" things. Is secret expiry that big of a concern? Chances are great that enterprise-scale organizations have at least one expired credential in production right now.

Introducing "Explain Flame Graph": Stop Fighting Fires and Start Explaining Them

In a modern observability deployment, it’s simple to get data that helps you understand where your system is failing. However, when we try to understand why, the answer is often buried beneath a mound of stack traces. For many developers, attempting to interpret a flame graph by manually calculating self-time (the resources consumed by the function itself) versus child-frame latency (the time spent waiting on called sub-functions) is both confusing and time-consuming.

3 Best Tools to Check DNS Records of Domains

DNS records are instructions that tell the internet how to handle your domain. They store details like your website’s IP address, email servers, and security settings. When someone visits your site or sends you an email, DNS records guide the request to the right server. Without correct DNS records, websites can break, and emails can fail. Many tools let you check DNS records, but not all provide clear, reliable results. Some tools show only basic records, while others provide deep insights.

16 new integrations - powered by AIready Low Code Plugins

Today marks a big milestone in our mission to bring more data, more context, and more visibility into a single, unified view. We’re excited to announce 16 brand‑new integrations, extending the range of data sources you can connect with just a few clicks. But the integrations themselves are only half the story.

Healthchecks and Cron Jobs on Status Pages

You can now add healthcheck and cron job monitors directly to your status pages. Until now, status pages only supported HTTP monitors and browser checks. You can now display the status of your background jobs, scheduled tasks, and internal services right next to your existing monitors. Head to your status page settings to add healthchecks to your sections. Questions? Reach out via in-app chat or email us at hello@hyperping.io.

Troubleshooting Microservices with OpenTelemetry Distributed Tracing

Distributed tracing doesn’t just show you what happened. It shows you why things broke. While logs tell you a service returned a 500 error and metrics show latency spiked, only traces reveal the full chain of causation: the upstream timeout that triggered a retry storm, the N+1 query pattern that saturated your connection pool, or the missing cache hit that turned a 50ms call into a 3-second database roundtrip.

Amazon Web Services outage - February 10, 2026

On February 10, 2026, Amazon Web Services (AWS) experienced an outage that triggered widespread reports of CloudFront failures and DNS resolution issues. While AWS later acknowledged the incident, StatusGator detected the disruption earlier using Early Warning Signals, giving customers valuable lead time before the provider confirmed anything publicly.

Accelerate incident resolution with Applications Manager's AI alert summary

Leverage AI to understand critical incidents in your IT infrastructure! With Applications Manager's AI driven alarm summaries, understand incident cascades to dig down to root cause of performance issues. Reduce cognitive labour and unlock actionable intelligence with the latest version.

Sovereign observability: How UAE data residency powers resilient digital economies

Cloud observability is a must for IT teams operating in modern digital economies. It allows administrators to see inside complex systems, understand how each component behaves under real conditions, and act before users or regulators feel the impact. In simple terms, observability transforms digital infrastructure from a black box into a transparent, accountable, and resilient system.

OpenTelemetry Deep Dive: Standards, Tracing, and the Future of Observability | Big Tent S3E6

OpenTelemetry co-founder Ted Young joins Grafana’s Big Tent podcast to explain how observability evolved beyond logs, metrics, and traces. Learn why tracing is just logging with context, how OpenTelemetry became a standard, and what’s next for zero-touch instrumentation and AI-driven observability.

Complexity to Clarity: Why Enterprises Are Choosing Progress WhatsUp Gold Over SolarWinds

A clearer path for enterprise administrators with simpler deployment, unified visibility and predictable costs. If SolarWinds feels too complex, costly or rigid to scale, it might be time to look for a new network monitoring solution. In this session, Progress WhatsUp Gold experts help enterprise admins achieve unified visibility and faster time to value without module or architecture sprawl, using transparent device-based licensing.

InfluxDB 3 Core vs. Enterprise

In this video, Senior Developer Advocate Cole Bowden walks you through the key similarities and differences that exist in InfluxDB 3 Core and InfluxDB 3 Enterprise. As an open source offering, Core thrives at data collection on the edge and providing real-time insights into fresh data, while Enterprise includes support, compaction for performant historical analysis over wide windows, better scaling and security for enterprise-scale operations.

NIS2 and CER Serve a Broader Purpose Than Cybersecurity - The 5 Biggest Risks You Need to Address Now

The European directives NIS2 (Network and Information Security Directive 2) and Critical Entities Resilience (CER) Directive have rapidly sharpened the conversation around digital resilience. While many organizations initially viewed these directives as an extension of their cybersecurity obligations, it is becoming increasingly clear that much more is at stake. These directives require a strategic transformation in how organizations manage risks, processes, and responsibilities.

The evolution of OpenTelemetry: A deep dive with co-founder Ted Young

Sometimes the biggest challenges in software aren’t about code — they’re about consensus. What do we call things? What do we standardize? And how do you evolve a system that thousands of companies depend on without breaking everything along the way?

AI-driven caching strategies and instrumentation

The things that separate a minimum viable product (MVP) from a production-ready app are polish, final touches, and the Pareto 'last 20%' of work. Most bugs, edge cases, and performance issues won't show up until after launch, when real users start hammering your application. If you're reading this, you're probably at the 80% mark, ready to tackle the rest.

AI Is Everywhere, So Why Isn't It Delivering Business Value?

Enterprises have never had more access to artificial intelligence and less certainty about what it is delivering. Generative AI tools now sit inside everyday workflows, embedded across productivity software and operational systems employees rely on for critical work. They generate insight at scale, reveal patterns more clearly than before, and offer earlier visibility into potential risk.

The Fragmentation Tax: What Multi-Tool Incident Response is Really Costing You

Here’s a question that sounds simple but isn’t: When something breaks in your environment, how long does it take your team to agree on what they’re looking at? Not how long it takes to fix it—that’s a different problem. I mean: how long does it take for everyone on the bridge to have the same basic understanding of what’s broken, where it started, and what it’s affecting?

What Is Web Transaction Monitoring?

Quick Answer: Web transaction monitoring is a type of synthetic monitoring that uses scripted browser tests to simulate and validate multi-step user workflows, such as logins or checkouts. It proactively checks application functionality and performance from end-to-end, ensuring critical user journeys work correctly before customers are impacted.

Honeybadger supports SSL certificate expiration monitoring

When you have a lot of websites, SSL certificate expiration monitoring can be a lot of work, especially without using a certificate authority such as Let's Encrypt. The last thing you want is an outage because a random SSL certificate wasn't set to auto-renew and expired! Honeybadger has your back! That's why we added SSL certificate warnings to our existing uptime monitoring feature.

8 Steps Companies Can Take To Strengthen Business Premises Security

Improving the safety of your business premises is a continuous process. New threats appear every year, and physical vulnerabilities can put your team and your assets at risk. Taking a proactive approach helps you stay ahead of potential intruders.

Top tips to organize your digital workspace

Top tips is a weekly column where we highlight what’s trending in the tech world and list ways to explore these trends. This week, we’re tackling a growing challenge for modern professionals: organizing digital workspaces in an era where files, apps, and notifications constantly compete for attention. As work becomes increasingly cloud-based and collaborative, a cluttered digital environment can slow teams down, create confusion, and impact productivity. The good news?

Syslog Checks: How to find Insights in the Data Flood

Every SysAdmin knows the feeling. They are swimming in logs—terabytes of them. Every daemon, service, and kernel subsystem religiously writing their activities to syslog. The data exists. The signals are there. Yet, somehow, incidents still are still unpredictable. How is this even possible? Here's why this happens: Traditional syslog infrastructure was designed for storage and retrieval, not detection and response.

Claude outage - February 10, 2026

On February 10, 2026, Claude users around the world began reporting service failures affecting chat sessions, API integrations, and Claude Code workflows. The first verified outage report reached StatusGator at 19:33 UTC. StatusGator issued an Early Warning Signal at 20:24 UTC. Claude did not post an official “Investigating” update until 22:11 UTC. This incident clearly demonstrates the gap between real user impact and official status page updates.

Uptrace Errors & Logs Tutorial: Capture Stacktraces, Context, and Traces in One Place

Every error tells a story — and Uptrace helps you see the full picture. In this tutorial, you’ll learn how to use Uptrace to capture errors, logs, stacktraces, and request context in a single observability platform. See how errors automatically link to traces, understand exactly what happened, and debug issues faster with rich attributes, user data, and performance impact. What you’ll learn: Understand not just *what broke*, but *who it affected and why* — and fix problems with confidence using Uptrace.

Uptrace Tutorial: Dashboards, Percentiles, Heatmaps & OpenTelemetry Metrics

Learn how to use *Uptrace* to measure what truly matters in your applications using percentiles, heatmaps, and histograms—then turn that data into dashboards that answer questions before they’re even asked. In this tutorial, you’ll discover how to: Whether you’re setting up observability for the first time or replacing expensive monitoring tools, this guide shows how Uptrace helps you understand performance, reliability, and user experience — all in one place.

End-to-End Tracing with Uptrace: Follow Any Request Across Your Entire System

Stop guessing where requests slow down. With Uptrace, you can follow any request across your entire system and instantly see performance bottlenecks, errors, and latency sources. This video covers: Build real observability, not just dashboards.

How to Prepare Your Network for RTO (Return-to-Office Mandates)

IT teams are being held hostage in the return-to-office debate. They didn't even get a seat at the table. And if you're not at the table, you're on the menu. The job market has cooled dramatically. Canada's unemployment rate hit 7.1% in August 2025, which is the highest since May 2016, excluding pandemic years. Employers noticed. And the RTO mandates started rolling out fast: The debate is heating up. Employees don't want to give up remote work. Executives want people in the office seats.

Splunk Attack Range v5 Demo

The Splunk Attack Range is an open source project that lets security teams spin up instrumented cloud environments, simulate adversary behavior, and use the generated telemetry to build and test detections in Splunk. Whether you are a detection engineer tuning rules, a purple team validating coverage, or a developer automating tests, Attack Range gives you a repeatable, cloud-based lab. This post highlights what Attack Range does, how it works, and how to get started - whether you prefer a web UI, a REST API, or the command line.

Will humans be replaced by AI? The truth

Agentic AI doesn’t replace analysts, it augments them. The real value comes from making teams more efficient, not smaller. This is the perspective most people miss. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

Understanding Lighthouse: Speed Index

You run Lighthouse and it tells you your Speed Index is bad. But the page looks like it loads fine. You see stuff on screen early. So why is Lighthouse acting like your site is a sloth? Speed Index is a “how fast does this page visually fill in” metric. Not “when did the first pixel show up” (that’s FCP) and not “when did the main content show up” (That’s LCP). It’s the whole above-the-fold loading experience, averaged over time.

Dashboarding Azure: SquaredUp vs Grafana

If you’re looking for a dashboarding solution today, chances are you’ve looked at Grafana or SquaredUp — or both. Grafana is a popular open source dashboarding tool with on-prem and cloud variants, while SquaredUp is the SaaS, cloud-based unified dashboarding solution. Both offer a comprehensive list of data sources that they can plug into and build dashboards. As such, they both also offer an integration with Azure - which is the focus of our discussion today.

How to Migrate an Icinga 2 Master in a High Availability Setup

Moving an Icinga 2 master to a new machine requires careful preparation, especially in a master-to-master high availability setup. In production environments, such migrations are often part of broader infrastructure changes, platform standardization, or long-term monitoring strategy decisions. This guide walks you through the process step by step, ensuring a smooth migration without service interruption while keeping your monitoring platform stable and consistent across the environment.

AI observability: The backbone of mission resilience in the public sector

Downtime cost the public sector $193 million last year — and the financial hit is only the beginning. Beyond the numbers, downtime in the public sector can also lead to severe consequences for citizens: interrupted access to critical online services, delayed benefits, and stalled emergency response. When citizens cannot rely on government services, downtime becomes more than an inconvenience; it becomes a matter of trust. More than uptime, resilience is the new success metric for modern government.

Troubleshooting & RCA with Olly

If troubleshooting still feels harder than it should, check on these two numbers: how many dashboards you have, and how many alerts fire every day. For most teams, it’s hundreds of dashboards and thousands of alerts, a sign of maturity, coverage, and good intentions. On the other hand, we also see that when something actually breaks, that coverage rarely turns into clarity fast enough.

Monitor Fortinet FortiManager performance in Datadog

As enterprises scale, teams often find it harder to identify user-reported issues. Software-defined wide area networks (SD-WANs) can make it easier to add branch offices, but they can also make it more challenging to distinguish connectivity degradation from changes in application behavior. FortiManager provides a centralized control plane for Fortinet Secure SD-WAN and reduces operational complexity.
Sponsored Post

From cloud costs to cloud value: The role of performance analytics in increasing ROI

Many cloud providers offer services that scale with usage. However, unanticipated overutilization of compute instances, serverless functions, or managed databases can quickly drive up costs. Managing these resources effectively is crucial for keeping cloud spending predictable.

AWS CloudFront Outage (Feb 2026): Timeline, Cascade, and Lessons

At approximately 9:15 PM UTC on February 10, 2026, Amazon CloudFront began returning NXDOMAIN responses for DNS queries against specific distributions. In practical terms: DNS was telling users that services behind those distributions simply didn't exist. The root cause was a DNS resolution failure within CloudFront's infrastructure that quickly spread to eight interconnected AWS services.

Why Monitoring Matters for Modern Hosting Platforms

With all the discussion in the dev community lately about changes made at Heroku, we wanted to use this moment to talk about PaaS (Platform as a Service) providers and how AppSignal can be a vital tool to ensure you're using your app's hosts for everything from optimal performance to lower usage bills.

Improve test coverage across codebases with Datadog Code Coverage

As codebases grow across many different services, it becomes harder to see what test suites actually cover. AI-assisted development and faster release cycles increase the volume of changes landing in repositories, raising the risk that untested code will make it through to production. To maintain a high standard, teams need clear and scalable visibility across repositories, consistent testing standards, and a way to catch blind spots before they reach users.

Move fast, don't break things: Consistent testing standards at scale

Moving quickly is essential for modern engineering teams, but speed without guardrails can introduce hidden risks in testing. As organizations scale, teams often define and apply coverage standards inconsistently across services and repositories. What qualifies as “acceptable coverage” in one project may be completely different in another. Without automated enforcement, untested code can slip through reviews.

A Step-by-Step Look at how Agentic, Autonomous ITOps Resolves Incidents

Agentic, autonomous ITOps improves incident response by carrying context from detection through resolution, reducing noise, delay, and manual coordination. Most IT incidents don’t fail due to missing data. Monitoring systems generate more than enough signals. The problem is that understanding those signals—and deciding what to do with them—happens in fragments. Engineers move between dashboards, logs, tickets, and chat threads, stitching together context by hand.

The Architecture Shift Powering Network Observability

If you work in network operations, you know that the only constant is the increasing complexity of the infrastructure you manage. The days of installing a monolithic software package on a single bare-metal server and letting it hum along for years are largely behind you. The software industry has largely shifted toward cloud-native architectures, microservices, and containerization. While these shifts promise agility and scalability, they also introduce significant operational complexity.

OpenTelemetry in Production: Design for Order, High Signal, Low Noise, and Survival

A lot of talk around OpenTelemetry has to do with instrumentation, especially auto-instrumentation, about OTel being vendor neutral, being open and a defacto standard. But how you use the final output of OTel is what makes business difference. In other words, how do you use it to make your life as an SRE/DevOps/biz person easier? How do you have to set things up to truly solve production issues faster?

A Notification List Is Not a Team

In the previous post, we looked at how alert noise is rarely accidental. It’s usually the result of sensible decisions layered over time, until responsibility becomes diffuse and response slows. One of the most persistent assumptions behind this pattern is simple. If enough people are notified, someone will take responsibility. After more than fourteen years of working with engineering teams of every size and shape, we’ve seen this assumption fail repeatedly.

Happy Birthday to Us: Honeycomb 10 Year Manifesto, Part 1

Christine and I started Honeycomb in 2016, which means it’s been ten years. Christine, a developer, and I, an operations engineer, were both profoundly unhappy with the state of the art in monitoring and logging tools. The tools we had used at Facebook didn’t spray our signals around to a bunch of siloed-off pillars. They consolidated as much context as possible so we could properly explore it, the way every other non-software engineering team already takes for granted.

Investigate Issues in Slack: Grafana Cloud Slack App with AI

The Grafana Cloud app for Slack brings observability and incident response closer to where you and your teams already collaborate Ask questions about system health, alerts, on-call schedules, and Grafana Cloud features; manage incidents and alerts; and collaborate with full context.

Agent vs Assistant: The key distinction between Olly and the competition

The market is saturated with agents and assistants, making it difficult to tell them apart. However, the difference between these two approaches is significant. They offer radically distinct levels of impact, reflecting major differences in both their technical complexity and the quality of their inferences. Let’s figure out the distinction.

Sentry acquires XcodeBuildMCP

Today we're announcing that Sentry has acquired XcodeBuildMCP, an open source MCP server that gives AI agents the ability to build, test, and debug native iOS and macOS apps. XcodeBuildMCP has become a go-to tool for agentic Apple-platform development, with more than 4,000 GitHub stars and an active community. It unlocks the full developer loop: build, run, debug, interact, and verify, allowing users to stay in their preferred agentic development environment.

A new perspective on dashboard sprawl

Dashboards are supposed to answer questions, not create more of them. But investigations don't stop at a single view. The moment you want to understand one specific thing in detail like a failing VM, a degraded service, a slow pipeline, dashboards start to break down. You end up either building yet another dashboard or searching through many different ones. SquaredUp's Perspectives changes this.

What Agentic AI Is Really Made Of (Most People Miss This)

Agentic AI isn’t just an LLM. Without the right context, it gives generic answers. This is the component that makes its decisions actually useful. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

How to Create and Manage Incidents in Uptime.com

Learn how to create and manage incidents on your Uptime.com Status Page to keep your subscribers informed about service disruptions and maintenance events in real-time. In this tutorial, we'll cover understanding incident statuses (Investigating, Identified, Monitoring, Resolved, and more), three ways to create a new incident, configuring incident details and timelines, adding updates with Markdown formatting, managing and editing incidents, notifying Status Page subscribers, and using the REST API for incident management.

VictoriaMetrics at FOSDEM, Cloud Native Days France, and CfgMgmtCamp Ghent

Last week, members of the VictoriaMetrics team, including myself, spoke at three very different but equally important community events: FOSDEM in Brussels, Cloud Native Days France in Paris, and CfgMgmtCamp in Ghent. Each event drew a different crowd with its own expectations, making them a good way to see where open source observability stands today and how VictoriaMetrics is adapting to real-world needs. The talks we gave were snapshots of the problems we are actively working on.

AI Query Assist for SolarWinds Database Performance Analyzer

Is your database slow? Let AI do the heavy lifting. Watch how SolarWinds DPA’s AI Query Assist transforms query tuning from a manual headache into a streamlined process. This demo shows you how to get instant, AI-powered recommendations for your worst-performing queries while maintaining the control to review and verify every fix. It’s not just about finding the problem—it’s about fixing it faster.

How to run checks on internal services with Grafana Cloud Synthetic Monitoring

Many critical services run inside private networks, where traditional monitoring tools and practices can’t offer full visibility. This makes it difficult to validate service availability and performance before problems impact your users. Synthetic Monitoring — a Grafana Cloud solution that helps you proactively monitor the performance of your applications and services — addresses this gap with a feature known as private probes.

What is DEX Ops?

For decades, IT operations have been built around incidents, SLAs, and ticket closure rates. Success has been defined by how quickly tickets are resolved and whether service levels are met. But the modern digital workplace has changed. Employee productivity, digital adoption, collaboration quality, and business performance depend on far more than ticket metrics. A device that “works” but performs poorly still erodes productivity.

Why distributed observability is straining and what new research reveals

Distributed systems quietly run much of today's digital world. People expect these systems to work reliably across regions and time zones for everything from money transfers to streaming platforms and AI-driven workloads. As organisations use more microservices, containers, and event-driven architectures, observability has become the main way for teams to understand what is happening in production.

Landscape Operations Automation beyond SAP Landscape manager

During the summer of 2024, SAP quietly announced the end of the Landscape Manager product. You can find out more from SAP directly here, including linked SAP Notes. LaMa Discontinued Community Post Unlike the news for Solution Manager or Focused Run, where the 2027 date signals a transition to extended support options, with LaMa the product is discontinued and extended support options aren’t available. For customers using Lama, the announcement and timeline are disruptive.

Introducing Skylar Advisor: You Need an Advisor, Not an AI Assistant

Skylar Advisor is a next-generation experience powered by Skylar AI, built to help IT teams focus on what matters right now. In this video, ScienceLogic Chief Product Officer Michael Nappi shares how Skylar Advisor proactively curates and summarizes key signals across monitoring tools, logs, and streaming telemetry into clear advisories your team can act on in seconds.

What Companies Get Wrong About Autonomous IT, And What Actually Moves Them Forward

Many organizations approach Autonomous IT with the assumption that adding more tools, more data, or more automation will eventually produce self-governing operations. This assumption creates the illusion of progress. Complexity does not resolve itself when new systems are layered on top of existing ones. In most environments, each new tool adds another interpretation of the truth, which compounds the cognitive load on teams and forces more reconciliation, not less.

VictoriaLogs in VictoriaMetrics Cloud: Fast, Cost-Effective Log Management is Here

Yes, you got it right: VictoriaLogs is now Generally Available in VictoriaMetrics Cloud! We believe that this is a huge milestone in our journey to deliver what our users are expecting from us: a complete, managed observability solution. If you’ve been following our quarterly updates, you know we’ve been after this launch for a while. In our latest update a few weeks ago we already announced that we were ready and today we’re making it official.

Exploring Splunk Alternatives [2026]: Deep Dive into Log Analysis

Splunk isn't bad software. It's genuinely powerful. But in 2026, a lot of engineering teams are asking a fair question: are we getting $300K worth of value out of this? More often than not, the answer is no. We went through 15 alternatives - read the docs, tested where we could, and talked to engineers who made the switch. This is what we found.

Same Work, More Windows: Why AI Isn't Paying Off Yet (w/ Anthony Firmin)

In the first episode of a NEW ERA for the DEX Show, Tom (that's right, just Tom ) welcomes back AI and digital transformation leader Anthony Firmin to unpack the reality of enterprise AI adoption. Drawing on hard-won, real-world experience, Anthony explores why so many organisations are stuck in the “messy middle” of AI, where usage rises but value doesn’t. The conversation digs into trust, experience debt, shallow versus deep AI, and why “same work, more windows” is an early warning sign leaders ignore at their peril. It’s a grounded, human-centred look at what it really takes to make AI improve work, not just change it.

Build, buy, or open source? Understanding your options with Grafana's AI-powered observability

Some questions in engineering never go away. Here’s one that every team eventually confronts: Do we roll up our sleeves and build the tooling ourselves, or do we buy something built for us? It’s a choice that has the power to speed teams up or hold them back. With the rise of AI-powered observability, this familiar software dilemma has re-emerged with higher stakes and faster-moving technology.

SRE Report: AI optimism and the economics of effort

For eight years, the survey behind the SRE Report has used a consistent methodology. That consistency allows us to track how reliability work evolves over time, rather than relying on snapshots. One of the most stable questions in the survey asks respondents to estimate how much of their work, on average, is spent on toil. Between 2020 and 2024, responses showed a gradual decline in reported toil.

What problem is agentic AI trying to solve?

Agentic AI isn’t limited to security operations. It’s already improving hospitals, financial systems, and service industries by reducing overload and filling skill gaps. Here’s the problem it was actually built to solve. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

IT Leadership Best Practices: Why People Matter More Than Tools - SolarWinds TechPod 106

IT leadership best practices fail when organizations focus on tools instead of people. In this episode of SolarWinds TechPod, hosts Chrystal Taylor and Sean Sebring speak with Jon Collins, Field CTO and VP of Engagement at GigaOm, about what truly drives success in IT leadership—people, culture, and principles. This expert discussion breaks down why frameworks like Agile, ITIL, DevOps, and AI-driven operations succeed or fail based on leadership behaviors, prioritization, and trust—not technology alone.
Sponsored Post

How to improve your Crash Free Users score in minutes

If you're reading this blog, you likely already know the importance of quality software. But with the overwhelming number of metrics that can be monitored and improved, development teams are struggling with what metrics they should prioritize to have the most significant impact. The Crash Free Users score in Raygun is a perfect place for development teams who care about software quality to focus their efforts. It tells you what percentage of users didn't encounter a crash or error while using your software and is an ideal north star to gauge the overall quality of your software.

How an AI assistant and MCP server deliver real-time cloud cost insights

Cloud costs don’t grow quietly. They spike, drift, and surprise teams at the worst possible moments, usually when someone finally opens a dashboard. While cloud cost management tools are powerful, getting quick answers often still means navigating multiple views, applying filters, exporting reports, and looping in the right people. But what if cloud cost analysis worked more like a conversation?

January 2026: IsDown Users Saved 9.2 Hours with Early Outage Detection

In January 2026, IsDown's early detection system gave users a cumulative advantage of 9.2 hours across 34 incidents — that's over half a business day of advance warning before vendors officially acknowledged their outages. The largest single detection advantage? A massive 2.2 hours for a SendGrid email delivery issue that left customers in the dark while their emails failed to reach Microsoft inboxes.

Detecting incidents without components

StatusGator monitors services and their individual components, so you can stay informed about the systems you rely on – and filter down to only the components you care about. Most status pages do a good job of tagging incidents to the affected components. But sometimes providers publish incident updates without marking any components as impacted, even when the incident clearly affects something real.

Continuous profiling in production: A real-world example to measure benefits and costs

Continuous profiling offers deep visibility into production environments, revealing exactly how applications consume CPU and memory. It’s the go-to observability practice for directly connecting system behavior and performance to specific lines of code. But when teams consider deploying continuous profiling more broadly, a common question comes up: what’s the overhead? Is it safe to run continuous profiling on my production services 24/7, or does the cost outweigh the benefits?

How to Optimize Your Article with Surfer SEO

Writing a good article is not enough anymore. The existing web contains millions of pages which compete for user attention and search engines determine which pages should appear at the top of search results. Optimization holds crucial value because it determines which websites will achieve success in online competition. The goal of our work is to develop content which answers user search queries. Surfer SEO exists specifically to fulfill this requirement.

What is agentic AI? (explained in 60 seconds)

Agentic AI is the next evolution of artificial intelligence. Unlike traditional AI, it can act autonomously and make decisions on its own. Here’s what that actually means, without the hype. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

Dashboard organization isn't about folders - it's about visibility

Having well-organized dashboards is just as important as having good dashboards. But dashboard organization shouldn’t just make things easy to find. It should provide structure that supports collaboration and efficient troubleshooting. It has to be more than a basic folder system. This post looks at how classic dashboarding tools handle organization today, where they fall short, and how SquaredUp Workspaces organize for visibility and shared context.

AI NetOps: How AI and Machine Learning Transform Network Operations

AI is changing network operations (NetOps) from static automation into adaptive, data-driven systems that can summarize incidents, retrieve knowledge, and guide remediation with human oversight. In this talk, Phil Gervasi breaks down what “AI for NetOps” really means in practice, including the difference between classical ML and large language models (LLMs), why data pipelines matter more than model tuning, and how patterns like RAG (retrieval augmented generation), text-to-SQL, and agentic workflows turn raw telemetry into decisions.

Heartbeat behind the metrics | Muraleedharan on support, scale, and seeing the product in the wild

What does observability look like when you’re responsible for customers at scale? In this episode of Heartbeat Behind the Metrics, Muraleedharan Sadhasivam, Head of Customer Success, talks about his 15-year journey at ManageEngine and the perspective you only get from being close to customers every day. He shares why custom dashboards matter so much, and why AppLogs is a feature he wishes more users explored to complete the MELT story. From querying logs to turning them into alerts and dashboards, he explains how real insights start when data is brought together.

Track cyber security with Reports in Digital Risk Analyzer

Discover how Site24x7’s Digital Risk Analyzer Reports help you instantly uncover vulnerabilities and assess multi-domain risks. In this quick walkthrough, learn how to view domain health, generate detailed or consolidated reports, schedule automated delivery, and share PDF insights with your team. Perfect for IT admins, DevOps, MSPs, and business leaders who want fast, actionable visibility into their cybersecurity posture.

How Multispectral Drone Surveys Enhance Monitoring and Operational Intelligence

A multispectral drone survey is a powerful form of drone data analytics that captures invisible light data, enabling predictive maintenance and NDVI multispectral mapping with drones. This guide explains how industries use this UAV multispectral inspection service to move from reactive fixes to proactive, data-driven asset management with UAV multispectral data. However, many organizations still struggle to convert large volumes of monitoring data into timely, actionable insight.

Top 10 Port Monitoring Tools of 2025.

Port failures don’t always take a service offline. A port stops accepting connections, times out intermittently, or gets blocked by a firewall change, while everything else looks healthy. When that happens, users feel the break long before uptime checks notice. This article reviews port monitoring tools from an operational point of view. It looks at how they detect closed or slow ports, how alerts behave in noisy environments, and where basic checks fall short during real incidents.

Monitor Load Balanced DNS Records with CIDR Ranges

DNS Check's load balancer monitoring now supports CIDR notation, making it practical to monitor domains served by CDNs and cloud providers that use large IP pools. Instead of listing every possible IP address a provider might return, you can enter CIDR ranges like 104.16.0.0/13 and DNS Check will verify that responses fall within those ranges.

Firewall check: How long until you know your Firewall has been down?

Windows Firewall is enabled by default, right? How sure are you? Even if you are 99.999% sure, this is how you have a possible vulnerability on your hands. There are numerous cases where someone disables Windows Firewall temporarily to troubleshoot a connectivity issue. The problem gets resolved. The firewall stays disabled—for months. Nobody notices until the security team investigates why sensitive data is suddenly appearing on dark web marketplaces.

Cloud Provider Status Report - January 2026

This report analyzes cloud provider status data for January 2026, covering 12 major cloud platforms: AWS, Azure DevOps, DigitalOcean, Fly.io, Heroku, Linode, Microsoft Azure, Netlify, Railway, Render, and Vercel. The data includes official incident reports from each provider's status page and early detection capabilities from IsDown's monitoring system.

Key Takeaways From the 2025 Gartner Market Guide for Event Intelligence Solutions

The 2025 Gartner Market Guide for Event Intelligence Solutions reflects a shift in how IT operations leaders evaluate AI-driven technologies. As AI hype gives way to more practical evaluation, we are seeing a natural departure from broad promises about AI capabilities toward clearly defined use cases and outcomes.

Event Intelligence Solutions Part Three: Best Practices for Successful Adoption

As Event Intelligence Solutions (EIS) move from early adoption to operational necessity, many enterprises are realizing that success depends on more than selecting the right technology. For Banking and Financial Services organizations, effective adoption requires a clear strategy, disciplined execution and a strong alignment to business priorities and regulatory demands and not least, customer expectations.

How we built Grafana Assistant - a conversation about AI development for observability

This conversation with Grafana Labs engineers, Mat Ryer, Cyril Tovena and Sven Großmann, dives deep into the engineering behind Grafana Assistant, exploring how agentic AI is transforming the observability landscape. From hackathon origins to sophisticated backend agents, the team shares candid lessons on building, scaling, and refining AI tools for engineers.

VirtualMetric DataStream + Google SecOps Integration: Pre-Ingest UDM Normalization at Scale

Google SecOps (formerly Chronicle) is widely used for large-scale security analytics, long-term telemetry retention, and detection across diverse environments. Its Unified Data Model (UDM) enables correlation across sources and supports analytics that operate over long time horizons. To take full advantage of these capabilities, security data must arrive in a consistent and well-structured UDM format. In practice, this is rarely the case.

Why Residential ISP ICMP Blocking Makes Remote Worker Monitoring Impossible (And What to Do About It)

When your company’s help desk receives fifteen "my connection is slow" tickets from remote employees in a single morning. Your network monitoring dashboard shows everything green; VPN concentrators running smoothly, bandwidth usage normal, no alerts. Yet employees can't get their work done. You try to ping their home routers. Nothing. Attempt a traceroute to diagnose the path. It dies at the ISP edge. Check your SNMP queries. They never make it past the residential gateway.

TraceExporter for VS Code

Percepio TraceExporter for VS Code makes it easy to export Percepio TraceRecorder snapshots during your debug session and open them directly in Percepio Tracealyzer. This is applicable for embedded systems based on Zephyr, FreeRTOS, SafeRTOS, Cesium, ThreadX or PX5, or using TraceRecorder’s “Bare Metal” option. The extension is currently provided in a Beta version as a downloadable.vsix file.

Instrumenting Code Using Prism and the Ruby Abstract Syntax Tree

A repository for this article can be found here.‍ When most developers think about request tracing, they picture instrumentation hooks inside familiar libraries. This allows us to track familiar metrics we see in application performance monitoring (APM) tools such as the duration of an HTTP call or how long a database query takes. But what if you could go deeper and instrument your own Ruby code automatically, without sprinkling timing calls everywhere?

Chrysalis Backdoor: What You Need to Know - and How Progress Flowmon Threat Briefing Helps You Stay Ahead

A newly analyzed threat, Chrysalis, is a sophisticated backdoor attributed to the Chinese APT group Lotus Blossom. The malware employs advanced evasion techniques including heavy obfuscation, API hashing, dynamic DNS resolution, custom encryption and stealthy C2 communication disguised as legitimate traffic.

How to Automate Alerts for Critical Directory Changes with Site24x7 Server Monitoring

It takes just one misconfigured deployment script to silently dump TBs of debug logs into a production server's/var/log directory. By the time anyone notices, the disk will be at 98% capacity, and multiple microservices would have already crashed. Incidents like these usually take hours to remediate and cost the team an entire sprint's worth of goodwill with stakeholders. This should never happen.

Skylar Advisor: Proactive Guidance for Modern Operations

Meet Skylar Advisor, bringing trusted and verifiable guidance to IT operations by connecting real time observability with your data and knowledge. Built AI native, it helps teams cut through alert floods, understand what matters most and why, and take the next best steps with confidence. Every recommendation is evidence backed and traceable to the exact data and sources used, so guidance is clear, explainable, and defensible when the stakes are high.

Heartbeat behind the metrics | Raghavan on building Site24x7

How do you build an observability platform that keeps up with constant change? In this episode of Heartbeat Behind the Metrics, Srinivasa Raghavan Santhanam, Director of Product Management at Site24x7, reflects on more than 15 years with the product and what he sees as its quiet strengths. He talks about GenAI as a hidden gem inside Site24x7, and you'll hear a standout customer story where a large Indian enterprise replaced 12 different tools with Site24x7, consolidating everything into a single platform. For him, that moment confirmed the platform’s ability to solve multiple problems at scale.

How to Use Pandas Time Index: A Tutorial with Examples

Time series data is everywhere in modern analytics, from stock prices and sensor readings to web traffic and financial transactions. When working with temporal data in Python, pandas provides powerful tools for handling time-based indexing through its DatetimeIndex functionality. This tutorial will guide you through creating, manipulating, and extracting insights from pandas time indexes with practical examples.

What you missed at OTel Unplugged 2026 in 8 minutes!

OTel Unplugged 2026 was different by design. Held alongside FOSDEM in Brussels, this was an unconference built by the OpenTelemetry community, for the community. No sales pitches. No product demos. Just honest conversations about what’s working, what’s broken, and where OTel needs to go next. In this recap, you’ll hear short interviews and reflections from engineers, maintainers, and practitioners on.

What Is Alert Noise Reduction? Techniques & Tools

Modern IT environments are noisy. The sheer volume of telemetry data coming forth every second from microservices, hybrid clouds, and containerized applications is just extraordinary. In IT Operations, NOC teams, and Site Reliability Engineers (SREs), this data is crucial, but only if it can be acted upon. When it’s not like this, everything becomes a background noise.

Sync Your Users Into Icinga Notifications: Introducing the Contacts/Groups API

If you’ve ever onboarded a teammate at 4:57 PM on a Friday (or offboarded one at 4:58 PM… ), you know the pain: keeping notification contacts and groups up to date is work. With the Icinga Notifications REST API, you can automate that and avoid drift.

What is Cybersecurity?

Cybersecurity refers to the processes and technology used to protect information technology networks, data, people, servers, endpoint devices and other IT-related systems from cyberattacks. The need for this protection has never been greater. All organizations (in both private and public sectors) now exist in a threat landscape that allows attacks against their IT infrastructure.

How Honeycomb Supercharges OpenTelemetry for AI

It has become common knowledge that the nature of software development has changed as AI-code generation and agent-based features gain adoption. In perhaps a more subtle shift, the fundamentals of software instrumentation are changing too. As OpenTelemetry becomes the standard instrumentation layer across enterprises, with thousands of developers (many from Honeycomb) actively contributing to it, the nature of the telemetry data captured itself is evolving to meet the growing demand for rich context.

The E-Commerce Critical Path Checklist

It’s your site’s huge, annual sale weekend, and your online store’s checkout process went down for 10 minutes. At your conversion rate, that’s $10,000 in lost sales. Thankfully, it came back up after only 10 minutes, but the real issue is that you only found out from customer complaints on social media. You spent months on email marketing and other campaigns driving traffic to this sale, and now those efforts are turning into customer frustration instead of revenue.

Kiro Can Now Reason With Lightrun's Live Runtime Context

AI code generation is fast. Making it reliable requires runtime context. Today, Kiro gains live runtime visibility with the Lightrun MCP. This grounds AI-assisted development in how code actually behaves at runtime. Kiro, the AI coding assistant from the teams at AWS, is built for velocity and intuition. It moves from specification to production with speed and structure, helping teams turn intent into working code. But until now, like every AI coding assistant, Kiro had a major blind spot.

Understanding Lighthouse: First Meaningful Paint

You’re reading an old performance article, and it keeps talking about “First Meaningful Paint.” You search for how to improve it, but every tool gives you different advice. Some don’t mention it at all. What’s going on? Here’s the short answer: First Meaningful Paint is dead. Google deprecated it in Lighthouse 6.0 back in 2020 and removed it completely in Lighthouse 13. If you’re still trying to optimize for FMP, you’re chasing a ghost.

The Human-Centric Stack: Why Logs Are the Great Equalizer in the Age of AI

In 2026, we are seeing incredible feats of engineering with agentic AI, impacting metrics and distributed traces that map thousands of microservices. Our systems have never been more intelligent and complex. However, as our observability becomes more intelligent, fewer employees know how to manage and troubleshoot complex systems. These employees, who often bear the brunt of an error’s impact, may need to rely on specialists to interpret the system.

Custom Dashboard Creation: Step-by-Step Tutorial

Creating a custom dashboard is the best way to monitor metrics that matter most to your systems. Tools like MetricFire make this process straightforward by combining hosted Grafana and Graphite, eliminating the need for self-hosted solutions. Here's how you can build dashboards tailored to your needs.

Grafana dashboards as code: How to manage your dashboards with Git

Note: This blog post originally published in May 2025 and was updated in February 2026 to reflect that Git Sync is now available in public preview in Grafana Cloud. As your Grafana instance scales, so does the challenge of maintaining dashboards. Managing dozens—or hundreds—of dashboards through the UI alone can quickly become overwhelming. Tracking changes gets murky, dashboards multiply, and consistency suffers.

Add skills to agents: Use Assistant playbooks for faster answers, investigations

Grafana Assistant is the most general-purpose tool we’ve delivered since dashboards. People use our Grafana Cloud LLM to understand unfamiliar areas of their stacks, generate dashboards and beautiful visualizations out of thin air, build queries, and support investigations.

Beyond a Billion Spans: Using Highlights for High-Speed Root Cause Analysis at Scale

In late 2025, we introduced Trace Highlight Comparison. This capability was designed to solve the problem of having too many spans. This causes technical and financial challenges when identifying performance patterns within high-volume telemetry streams. The goal is to avoid massive indexing costs and eliminate the ingestion latency associated with indexing every record. However, knowing these trends is only half the battle.

Top 9 Observability Tools for AI-Assisted Development & Deployment

AI-assisted development is rapidly becoming the default way software is built. Code generation, AI copilots, agentic pull requests, and automated refactoring are now embedded directly into engineering workflows. While this shift dramatically increases delivery speed, it also introduces a new operational reality: production systems are changing faster than humans can fully reason about them. This is where observability becomes mission-critical.
Sponsored Post

Why Every MSP Needs Centralized SaaS Monitoring

Your monitoring stack catches server failures, network issues, and application crashes. But what happens when Microsoft Teams goes down across half your client base at 3 AM? Your on-call tech gets bombarded with alerts that all trace back to one root cause they can't see. This is the MSP blind spot: third-party SaaS dependencies that sit outside your monitoring perimeter but directly impact your SLAs.

Observability trends for 2026 (Part 2): GenAI and OpenTelemetry reshape the landscape

Over the course of my 20 years as a developer, SRE, and now observability product leader, software has typically progressed at a good pace. But now, the emergence of two transformative technologies are fundamentally reshaping enterprise observability: generative AI (GenAI) and OpenTelemetry (OTel). We surveyed over 500 IT decision-makers for a new report:The Landscape of Observability in 2026: Balancing Cost and Innovation.

AI Agent Governance: How to Keep Agentic ITOps Workflows Safe

The future of ITOps automation is better control over what AI agents can see, share, and do. AI automation in ITOps is expected to resolve incidents, reduce operational load, and operate with limited human involvement. Those outcomes depend on systems that can take action, not just surface insight. Agentic AI enables that shift. AI agents can correlate signals across tools, update tickets, trigger remediation, and coordinate workflows without waiting for instruction.

ISO 27K Without the Bloat: An Open Source Approach

It’s often framed as an enterprise-only exercise: long timelines, expensive tooling, consultants everywhere, and a lot of compliance work that exists mainly to survive an audit. As a ~40-person, engineering-driven SaaS company, we needed the same level of trust and rigor as much larger organizations — but we weren’t willing to accept shelfware, parallel compliance infrastructure, or controls that only exist on paper. We also didn’t stop at ISO 27001.

Make faster, better product decisions with Datadog Product Analytics

Product managers (PMs) need to make fast, confident decisions about what to build, fix, and improve based on user behavior within their application. But in practice, collecting the user insights they require is rarely straightforward. Recent updates to Datadog Product Analytics address this challenge. Product Analytics adds structure to autocaptured data and makes analysis easier to interpret, reuse, and share, helping PMs move from questions to answers without relying on SQL or engineering.

Surface and remediate runtime posture issues with Workload Protection Findings

Threat detection and runtime posture monitoring are related but different jobs. Security teams already rely on Datadog Workload Protection to detect threats in real time across hosts and containers. But the actions that lead to those detections (file manipulation, process execution, network calls, or kernel activity) can be indicative of compromise or simply of risky behavior—like running compilers in production containers.

Alert Noise Isn't an Accident - It's a Design Decision

In a previous post, The Incident Checklist: Reducing Cognitive Load When It Matters Most, we explored how incidents stop being purely technical problems and become human ones. These are moments where decision-making under pressure and cognitive load matter more than perfect root cause analysis. When systems don’t support people clearly in those moments, teams compensate. They add process. They add people. They add noise. Alerting is one of the most visible places where this shows up.

The Grok-to-AI Evolution: Why Modern SREs Are Moving Beyond Manual Parsing

Grok structures logs. Context engineering connects systems. AI explains behavior. For years, Grok patterns have been the workhorse of the SRE world. Built on regular expressions, Grok helps teams extract structure from unstructured logs. As we explored in "Do You Grok It?", Grok is the key to turning messy log lines into usable fields. It's why our Grok Pattern Reference remains one of our most-visited resources — SREs are hungry for structure.

Observing agentic AI workflows with Grafana Cloud, OpenTelemetry, and the OpenAI Agents SDK

As agentic AI applications are used more broadly in production, they introduce new operational models, combining multi-step reasoning, tool execution, and autonomous decision-making into a single workflow. SRE teams need visibility into how these agents behave, where they fail, and how they perform over time.

Monitoring Sprawl: Why IT Teams Still Can't Get Actionable Insight Fast

IT teams collect extensive monitoring data but struggle to turn it into fast, confident decisions during incidents. Most IT leaders aren’t worried about whether their environments are monitored—they’re worried about whether their teams can make sense of what they’re seeing quickly enough to actually resolve issues. When something breaks, the problem usually isn’t finding data. Dashboards show activity, alerts indicate changes, and logs capture events across the entire stack.

How to Enhance Service Management for Small Firms

Small firms juggle many tasks at once. They serve clients while managing budgets and staff. Most owners spend their days putting out fires instead of building better systems. Poor service management drains resources fast. Client requests get lost in email threads. Team members use different tools for the same tasks. Bills slip through the cracks. These problems cost money that small businesses can't afford to lose.

January 2026 Early Warning Signals

January 2026 saw a wave of high-impact service disruptions across social platforms, telecom providers, developer tools, education services, and streaming apps. In several cases, StatusGator detected problems minutes or even hours before providers publicly acknowledged them, and in many cases, providers never acknowledged them at all. Unfortunately, many providers still do not have public status pages, leaving users with little visibility into what is happening during an outage.

OpenTelemetry Instrumentation Best Practices for Microservices Observability

OpenTelemetry instrumentation is the foundation of modern microservices observability, but getting it right in production requires more than just enabling auto-instrumentation. This guide covers production-tested OpenTelemetry best practices that help engineering teams achieve reliable distributed tracing, control observability costs, and extract maximum value from their telemetry data.

30+ Top Observability Tools to Monitor Websites and Applications [2026 Updated]

By incorporating observability tools into your stack, you can better understand how your complex infrastructure operates, reduce downtime, and empower developers to identify and fix problems quickly. However, it now takes considerably more work, time, and money to build the best observability tools for your infrastructure and applications. According to a Splunk survey, over half of the firms polled employ eight or more observability tools.

Protect agentic AI applications with Datadog AI Guard

Organizations are increasingly using agentic AI applications powered by large language models (LLMs) to automate analysis, decision-making, and operational workflows. As these AI agents take on more responsibility, they gain access to internal tools and services and can interact with them in unintended ways.

How to optimize JavaScript code with CSS

When to use JavaScript or CSS in frontend projects is a matter of continued debate among many frontend developers. JavaScript is often the default choice for frontend development, as it offers a robust collection of libraries custom-made for creating advanced UI features, such as data-based visualizations or complex animations. But JavaScript also comes with tradeoffs, particularly when it comes to performance, accessibility, and code complexity.

Trace Google Pub/Sub workloads in Cloud Run with Datadog

Event-driven systems are great at decoupling services, but they also make incidents harder to untangle. A single user request can turn into dozens (or thousands) of messages, multiple consumers, retries, and delayed acknowledgments. If your tracing only tells you that a message was sent or received, you still have to guess which upstream request produced the message, whether a batch publish fanned out cleanly, and where queue time is accumulating.

Exponential Smoothing: A Guide to Getting Started

Exponential smoothing is a time series forecasting method that uses an exponentially weighted average of past observations to predict future values. In other words, it assigns greater weight to recent observations than to older ones, allowing the forecast to adapt to changing data trends. In this post, we’ll look at the basics of exponential smoothing, including how it works, its types, and how to implement it in Python.

How to Implement Distributed Tracing in Microservices with OpenTelemetry Auto-Instrumentation

This guide shows you how to implement OpenTelemetry’s auto-instrumentation for complete distributed tracing across your microservices, from initial setup through production optimization and troubleshooting.

Every CIO is asking the same question: Am I Next?

Every CIO is asking the same question: Am I next? We’ve seen it across cloud providers, carriers, and global platforms—organizations with enormous scale and investment still experience public, business-impacting outages. The risk isn’t lack of effort. It’s the growing gap between AI-driven complexity and the ability to see, understand, and resolve issues fast enough to protect availability commitments.

Skylar Advisor: Proactive Guidance for Modern Operations

Meet Skylar Advisor, bringing trusted and verifiable guidance to IT operations by connecting real time observability with your data and knowledge. Built AI native, it helps teams cut through alert floods, understand what matters most and why, and take the next best steps with confidence. Every recommendation is evidence backed and traceable to the exact data and sources used, so guidance is clear, explainable, and defensible when the stakes are high.

How Prometheus Remote Write v2 can help cut network egress costs by as much as 50%

Back in 2021, Grafana Labs CTO Tom Wilkie (then VP of Products) spoke at PromCON about the need for improvements in Prometheus' remote write capabilities. “We use between 10 and 2 bytes per sample to send via remote write, and Prometheus only uses 1 or 2 bytes per sample on the local disk so there’s big, big room for improvement,” Wilkie said at the time.

Grafana Assistant: Why you can trust our agent-and yourself-in an era of AI hallucinations

Let’s be real: AI can hallucinate. And in observability, that feels risky. No one wants an assistant that sends your SREs chasing ghosts. At best, that burns expensive engineering time. At worst, it slows incident response in production and pushes teams toward the wrong remediation path. So here’s the big question: What makes Grafana Assistant different, and why should you trust it? Let’s start by acknowledging the fear. AI hallucinations are a real issue.

Elastic 9.3: Chat with your data, build custom AI agents, automate everything

Today, we are pleased to announce the general availability of Elastic 9.3 as the latest version of the Elasticsearch Platform — the world’s most popular open source platform for transforming both structured and unstructured data into trusted answers and outcomes. In addition to including new features that help developers with context engineering and agent building, Elastic 9.3 introduces a broad set of new capabilities to Elastic Search & AI, Elastic Observability, and Elastic Security.

You Need an Advisor. Not an AI Assistant.

Complex environments don’t fail because teams lack data. They fail when teams can’t trust what the data is telling them. There are too many signals, too little time, and too much risk riding on every decision. That’s the reality Skylar Advisor is built for: delivering guidance teams can verify, so they can act faster without gambling on opaque, black-box answers.

Tool Consolidation Is Dead. Long Live Agentic AI.

It’s 2026, and developers have more tools at their disposal than at any point in the industry’s history: CI/CD platforms are richer; observability stacks are deeper; security, data, and AI tooling have exploded into crowded, competitive ecosystems. And yet, delivery is still slow, incidents are still noisy, workflows are still brittle. The problem is no longer tool scarcity or feature depth. It’s integration debt.

Are We Letting AI Think for Us? | SolarWinds TechPod #105

We’re more dependent on technology than ever—and AI is changing how we make decisions. But what happens when the systems fail? Or when bad actors decide to “pull the plug”? This clip dives into a scary but necessary question: Are we losing our ability to critically think and problem-solve by relying too much on AI? Is AI leveling the playing field—or quietly taking over human decision-making? A must-watch conversation about innovation, outages, AI risk, and why having a backup plan matters more than ever.

How does Coralogix go beyond basic migration?

When a team, division or organization is assessing a new vendor, there are some basic questions that must be answered. At Coralogix, we look at migrations in a different way. It isn’t about transporting the current state of play into a new vendor, often called a “lift and shift”. These are the basics. There is a whole new level of onboarding and support that doesn’t just replicate value across platforms – it expands it.

Andy Wojnarek Appointed Chief Technology Officer

ATS Group and Galileo are pleased to announce the appointment of Andy Wojnarek as Chief Technology Officer. Andy’s appointment reflects the evolution of a technical leadership role he has developed over more than 16 years with the company, grounded in hands-on expertise, cross-functional influence, and a sustained focus on solving complex infrastructure and observability challenges for clients.

Observability vs Monitoring: Getting a Full Picture of the Environment

Driving down the highway, you usually glance intermittently at your speedometer to ensure that you stay within the speed limit, or whatever window above the speed limit you’re willing to drive. While monitoring your speed mitigates the risk of a ticket, you still need to look out for various threats on the road, like cars going through stop signs. By observing your surroundings, you take in real-time information that can help prevent a crash.

"Not Having Kentik Is Unacceptable": 5 Service Providers on Kentik

Five customers explain why Kentik is essential for understanding traffic, controlling network cost, and planning for growth. Hear from Sorin Esanu (Race Communications), Michael Leclaire (MetroNet), John Lubeck (Midco), Wallace Lee (Imperva), and Everett Sinclair (Conway Corporation) on how they use Kentik to see “the bits on the wire,” dig deeper than traditional reporting tools, and turn network data into better customer experiences.

3 Service Providers on Kentik AI Advisor: Faster Answers, Faster Fixes

Three service provider customers share how Kentik AI Advisor helps them move faster, troubleshoot smarter, and put network data in more hands across the team. Hear from Everett Sinclair (Conway Corporation), Michael Leclaire (MetroNet), and John Lubeck (Midco) on why they chose Kentik to unify flow analytics, baselining, and anomaly detection in one platform, and how Kentik AI features make it easier to explore, explain, and act on what’s happening in their networks.

Introducing WHOIS History & Monitoring and Phishing Sentinel: Complete Brand Protection for Your DNS Infrastructure

DNS Spy now offers complete brand protection with WHOIS History & Monitoring and Phishing Sentinel—automatically tracking domain registration changes and detecting phishing variants before they become security incidents.

How to Build AIPowered Search with Elasticsearch [2 Min Live Demo]

In this demo, we show how Elasticsearch enables production‑ready GenAI and AI‑powered search applications—from indexing and embedding your data to grounding large language models with RAG. You’ll see how developers can go from raw data to a fully functional GenAI search experience—fast Additional Resources.

Watching everything is watching nothing: Sampling strategy for Sentry

In a high-traffic production environment, telemetry is your most direct link to the user experience. Every Span, Trace, Log, and Replay sent to Sentry gives you high-fidelity visibility into what is actually happening in production. But to extract the most value out of that visibility, you have to know how to filter signal from noise.

How Okta keeps 99.99 percent uptime with #datadog

How do you maintain 99.99 percent uptime across thousands of Kubernetes hosts and multiple cloud providers? Okta engineers explain why observability is critical to keeping authentication and authorization services running at scale. Watch how Okta uses Datadog to bring metrics, logs, and traces into a single view, speed up root cause analysis, and reduce time to mitigation while controlling costs.

10 Benefits of Remote Network Monitoring (RMON)

The rise of hybrid work has fundamentally changed where IT problems occur. Five years ago, most network issues happened in your data center or office network (infrastructure you could access, control, and troubleshoot directly). Today, the majority of critical issues occur in home offices, coffee shops, and remote locations where you have zero infrastructure access and limited visibility.

Top 10 SSL Monitoring Tools.

SSL failures don’t usually break a site all at once. A certificate expires, a chain changes, or a browser update tightens rules, and users start seeing warnings before teams notice. By the time alerts fire, trust has already taken a hit. This post reviews SSL monitoring tools from an operational standpoint. How they detect upcoming expirations, validate certificate chains, and surface issues across environments and domains.

How to Find and Fix SEO Errors on Your Website with Ahrefs

Ahrefs Site Audit helps you look at your website the way a search engine does. The tool crawls all major website pages to examine their interconnections and loading methods and page display formats which results in a comprehensive SEO problem report that shows which issues require immediate resolution.