Operations | Monitoring | ITSM | DevOps | Cloud

Sponsored Post

7 Downdetector Alternatives

Downdetector is one of the best-known outage-tracking platforms, but its consumer-first approach has limitations for technical teams. Its reliance on user-submitted incident reports makes it prone to noise, false positives, and incomplete coverage of B2B and cloud-specific services. That's why we're exploring the best Downdetector alternatives available today, and highlighting which ones work best for businesses.
Sponsored Post

Innovating Security with Managed Detection & Response (MDR) and ChaosSearch

Managed Detection and Response (MDR) services occupy an important niche in the cybersecurity industry, supporting SMBs and enterprise organizations with managed security monitoring and threat detection, proactive threat hunting, and incident response capabilities. In this week's blog, we're taking a closer look at the role of MDRs in cybersecurity, the biggest challenges they face, and how integrating ChaosSearch is helping MDRs manage complexity, reduce data retention costs, and enable long-term security analytics use cases that are critical for customer success.

Reducing Alert Fatigue in Microsoft SCOM

Alert fatigue is one of the most common challenges organizations face when using Microsoft System Center Operations Manager (SCOM). The sheer volume of notifications from servers, applications, network devices, and cloud services can overwhelm IT teams, making it difficult to distinguish between critical incidents and low-priority events.

Docker Daemon Logs: How to Find, Read, and Use Them

Sometimes Docker behaves in ways that catch you off guard—containers don’t start as expected, images pause during pull, or networking takes longer than usual to respond. In those moments, the Docker daemon logs are your best reference point. These logs capture exactly what the Docker engine is doing at any given time. They give you a running account of system state, performance signals, and events that help you understand what’s happening beneath the surface.

Datadog Feature Flags, track Claude costs, migrate historical logs, and more | This Month in Datadog

See how you can reduce risk during feature rollouts in September’s This Month in Datadog. This episode, we spotlight Datadog Feature Flags, which combines advanced targeting with built-in observability, and guardrails to make rollouts safer and more controlled. Plus, we cover: This Month in Datadog brings you the latest updates on our newest product features, announcements, resources, and events.

From Logs to Insights: Accelerate Customer-Impact Analysis with Datadog Sheets

Datadog Sheets helps you move from log exploration to actionable insights quickly and with no code required. In this demo, see how to enrich logs with Salesforce data, build pivot tables, uncover customer impact trends, and build shareable reporting, all within Datadog.

Kubernetes monitoring 101: Best practices to kickstart your journey

Use this guide to help you build a solid observability foundation without getting overwhelmed and get started with the best practices for a practical Kubernetes management. Starting your Kubernetes journey can feel like diving into the deep end; with hundreds of metrics, endless logs, and a growing list of tools, it's easy to lose focus. But here's the good news: you don't need to monitor everything from day one. Instead, start small.

5 Tools for Monitoring WebSocket Connections in Real Time

What if your app, website, or online platform suddenly starts crashing? Users cannot connect with the application, nothing is loading, and complaints start coming in. You contact your developer. They checked the backend technicalities like API, server, and databases, and everything seems fine. So, what is the real problem here? In many real-time applications, this issue lies one layer deeper. Most people often overlook this issue, and that is: WebSocket connections.

Node.js Monitoring in Serverless Environments - A Complete Guide

Serverless computing with Node.js is transforming how applications are built and scaled by removing the need to manage servers. However, serverless functions run for short durations and scale dynamically, making traditional monitoring ineffective. Effective monitoring is essential to track performance, detect errors, optimize cold starts, and control costs.

How to boost observability ROI with continuous profiling and Grafana Drilldown

For the longest time, observability was centered around logs, metrics, and traces, but the growth of more complex systems has made continuous profiling another essential part of maintaining healthy systems. It provides insights into resource usage and latency down to the code level, delivering key insights to improve performance.

Ship features faster and safer with Datadog Feature Flags

Releasing new features is one of the highest-stakes moments in the software delivery life cycle. Even with CI/CD pipelines in place, plenty of things can still go wrong when a feature goes live for actual users. Most feature flagging tools operate in isolation from important observability tooling, forcing engineers to monitor changes across multiple disconnected systems to fully understand their impact. This slows down development and increases the chance of missing critical issues.

What's New in InfluxDB 3.5: Explorer Dashboards, Cache Querying, and Expanded Control

InfluxDB 3.5 is now available for both Core and Enterprise, along with updates to the new Explorer UI that make it easier to save, organize, and query your data. This release highlights the biggest updates since our 3.4 release, including Explorer Dashboards in beta, new cache querying capabilities, and stronger operational tools for managing clusters. InfluxDB 3 Core is free and open source, optimized for recent data, and licensed under MIT and Apache 2.

Paving the way for a new era: Mezmo's Active Telemetry

The world of software development has fundamentally changed. We've moved from monthly releases to continuous delivery measured in minutes, and the rise of AI means velocity is no longer just a goal—it's a requirement for survival. But this relentless speed has exposed a critical flaw in how we approach observability. The industry relies on a "store first, ask questions later" model where you collect every log, metric, and trace, and then hope to find the root cause when something breaks.

The Compliance Shortcut: Automation as the New Operating System for Resilience

For years, compliance has been synonymous with checklists, manual reporting, and time-consuming audits. That definition no longer holds. In our September 2025 webinar, Patrick Hubbard, Technical Marketing Director, led a conversation with JB Baker, Vice President of Product Engineering, and Marc Jensen, Channel Sales Engineer. Together, they showed how automation is transforming compliance into something far more strategic: the foundation of modern resilience.

How to Boost Revenue and Cut Network Spending with Kentik Traffic Costs

Network operators across the digital ecosystem are under pressure to cut costs while protecting revenue. This post explores three practical use cases where Kentik Traffic Costs helps turn traffic insight into commercial intelligence that helps teams negotiate smarter, protect margins, and boost profitability.

What Is RabbitMQ And How Do You Manage It With Kubernetes?

The world of Kubernetes and RabbitMQ evolves rapidly. Our popular 2022 post laid the groundwork for HA deployments; now, join us for the crucial 2025 update to ensure your architecture remains cutting-edge. As organizations continue their powerful shift from monolithic architecture (where all the code building the application exists as a single, monolithic entity) to microservices architecture.

Your infrastructure Is more distributed than you think.

An eCommerce platform, a banking app, even a simple user portal depends on a web of APIs, cloud tools, hosting services, and edge networks. Each one introduces another potential point of failure. And when those dependencies break? User experience suffers. Brand trust takes a hit. Millions in revenue are at risk. That’s why leading digital businesses, especially in eCommerce and banking, are expanding visibility beyond the application stack.

Resolve website transaction bottlenecks faster with Step Summary and Step Performance Reports

Ever wondered why some steps on your website feel slower than others? In this video, we’ll show you how to spot slow logins, delayed checkouts, and page load issues, and how to pinpoint their causes so you can fix them fast using the Step Summary and Step Performance reports. You’ll learn how to access these reports, what insights they provide, and how they help you quickly pinpoint performance bottlenecks to ensure a seamless user experience.

Build on Your Microsoft SCOM Foundation

Enterprises that rely on Microsoft System Center Operations Manager (SCOM) as their monitoring backbone often share an everyday reality: the bigger the environment, the bigger the challenges. Noisy alert storms can bury critical issues. Management Packs (MPs) require ongoing care and expertise to deliver accurate insights. And without consistent reporting, teams risk slipping into reactive fire-fighting instead of strategic monitoring.

13 Best Windows Monitoring Tools in 2025

It’s 2 AM, and your phone buzzes with an urgent alert—your primary server application is down, and users are flooding the support channels with complaints. As you dive into the logs, the cause is elusive, buried somewhere in the sea of system events. Is it a rogue service eating up memory? A failing disk? Or a network bottleneck? Without powerful Windows monitoring tools, you’re left troubleshooting in the dark.

Best Web Transaction Monitoring Tools in 2025

Websites are no longer static pages. They’re dynamic, transaction-heavy ecosystems where every click, form submission, and login matters. Whether you’re in e-commerce, SaaS, or finance, transaction failures can lead to revenue loss, frustrated customers, and even damage to your brand. That’s where web transaction monitoring tools come in — a critical component to make sure every interaction goes smoothly.

What is AI-Native Monitoring? The Complete Guide for Developers

Before we talk about AI-native monitoring, let’s take a quick step back to make sure everyone is on the same page. In software engineering, monitoring is the continuous collection and analysis of data about a system’s health, performance, and behavior. Tools like Scout Monitoring, Datadog, and New Relic traditionally track server uptime, request latency, error rates, and database performance.

Top 11 Java APM Tools: A Comprehensive Comparison

Are your Java applications running at their optimal performance, or is there room for improvement to make them faster and more efficient? With so many services depending on Java, keeping applications responsive and reliable is a core part of modern software engineering. This blog walks you through the leading Java Application Performance Monitoring (APM) tools, with a clear comparison to help you choose the right option for your needs.

Telemetry Now Teaser: "Turning Network Telemetry Into Financial Insight"

Network operators prioritize cost, performance, security, and reliability as their core foundational needs. But how do they get the economic data to make tradeoffs when one of these needs suffers? Tune into the latest Telemetry Now with special guest Lauren Basile to learn how Kentik Traffic Costs is providing data-backed answers to these questions.

How We Built VictoriaLogs Cluster: A CTO's

Go behind the scenes with the VictoriaMetrics team! In this special talk, Marc Sherwood is joined by our CTO, Alexander Marshalov, to explore our powerful, open-source logging solution, VictoriaLogs. This isn't just a feature showcase. This is a deep dive into the engineering mindset that drives our development. Alexander shares firsthand insights into why we built VictoriaLogs Cluster, the technical challenges of creating a distributed system for logs, and the core principles of simplicity and efficiency that guide our architecture.

How GenAI Is Empowering Elastic Workforce

With over 10,000 questions answered and a 99% satisfaction rate in just 90 days, ElasticGPT, our internal generative AI assistant built on Elastic’s Search AI Platform, is transforming how our teams find information, make decisions, and complete day-to-day tasks. Matt Minetola, CIO, explains how ElasticGPT helps employees access company knowledge faster using natural language queries. Learn how we’re using retrieval augmented generation (RAG) and a secure, scalable architecture to deliver trusted, real-time AI experiences across the organization.

The telemetry time bomb - and what to do about it

Telemetry data is growing at an average of 29% a year — doubling costs every 18 months. That’s putting pressure on ITOps budgets, observability platforms, SecOps teams, and SIEM deployments alike. In this post, we’ll explore how unchecked data volumes, siloed tools, and aging architectures are creating a telemetry cost crunch that limits visibility, slows both troubleshooting and threat detection, and impacts business outcomes.

Model your architecture with custom entities in the Datadog Software Catalog

Every software organization has its own unique architecture and workflows. Beyond services and APIs, teams rely on internal libraries, CI/CD jobs, data pipelines, AI agents, and more to keep systems running smoothly. But as architectures grow more complex and interconnected, it can become difficult to keep track of all the structural dependencies and interactions in one place.

Why Does Your Node.js App Crash in Production and How Can You Fix it?

Node.js has become one of the most popular platforms for building scalable and high-performance web applications. Its event-driven, non-blocking I/O model allows developers to efficiently handle thousands of concurrent connections with minimal overhead. However, many businesses still face a critical challenge, Node.js applications often crash unexpectedly in production environments, causing downtime, lost revenue, and damage to brand reputation.

Say hello to one-click call quality and SBC data correlation for Teams Phone

For IT teams managing Teams Phone performance, the SBC is kind of a ‘last frontier’. Issues that occur there are hard to track down. Critical information slips out of view or can’t easily be associated with what users are experiencing. Our latest update to Vantage DX closes the Microsoft Teams Phone SBC monitoring visibility gap with automated correlation of SBC records and Teams Phone quality data — the first solution in the industry to do so.

OpenMetrics vs OpenTelemetry - A guide on understanding these two specifications

OpenMetrics and OpenTelemetry are popular standards for instrumenting cloud-native applications. Both projects are part of the Cloud Native Computing Foundation (CNCF) and aim to simplify how we generate, collect and monitor services in a modern cloud-native distributed application environment. Let's have a look at how both the standards are aiming to help solve the observability conundrum.

Grafana Campfire - UI Extensions: Enabling Cross-App Workflows (Grafana Community Call -Sept 2025)

In this upcoming Grafana Campfire Community Call, we will talk about the Grafana UI Extensions, where we will discuss how the Framework enables plugins to interoperate by adding links, components, or functions into defined places (extension points) in Grafana. We will talk about (but not limited to):What are UI Extensions and where you find the resourcesHow it can be leveraged to deliver fun new featuresAdding custom actions, links, and components to various parts in UI and much more.....

How to Become an SRE Engineer

Site Reliability Engineering has emerged as one of the most sought-after careers in tech, combining software engineering expertise with operational excellence. SRE engineers ensure that critical systems remain reliable, scalable, and performant while enabling rapid feature development. With the global SRE job market projected to grow by over 25% in 2025, skilled professionals in this field command competitive salaries and enjoy diverse career opportunities across industries.

Introducing Catchpoint Session Replay: See Digital Experience Through Your Users' Eyes

When was the last time you really saw what your customers experience on your site? We're excited to introduce Session Replay, a new capability in our Internet Performance Monitoring (IPM) platform that lets you step directly into the user's journey. Session Replay is so much more than a platform upgrade. It’s an opportunity to understand, fix, and even prevent the issues that lead to churn, missed conversions, and frustrated users, all from their point of view.

Synthetic Monitoring from Multiple Locations: Where to Run Tests (and Why It Matters)

Most organizations think of monitoring as a checkbox: set it up once, confirm that it runs, and move on. If the tool says the website is “up,” then the job is done, right? Not quite. The truth is that where you run synthetic monitoring tests from can be just as important as the tests themselves. Synthetic monitoring works by simulating user actions from pre-defined probes or agents. Those probes might live in a cloud data center, a mobile network, or even inside a corporate office.

Harnessing AppNeta's Browser- and HTTP-based Workflows to Track User Experience

These days, maintaining uptime of your servers and other infrastructure elements remains as critical as ever—but it’s not enough. Quite simply, even the best server reliability metrics won’t mean a thing if the user experience is poor. What truly matters is understanding the service levels your users experience, whether they’re accessing apps through a web browser or interacting with API-based services.

Defining the Network Engineer of Tomorrow

A little while ago, I wrote a piece with the provocative title, "The End of the Network Engineer as We Know It?" It struck a chord because it articulated a shift many of us feel in our bones: the ground is moving beneath our feet. The traditional, well-defined corporate network has dissolved into a sprawling, borderless ecosystem of public clouds, SaaS platforms, and the vast, untamed internet. The old role, focused on the care and feeding of devices within our four walls, is no longer sufficient.

Grafana Labs Co-founder Woods: Market maturity, OpenTelemetry, and AI are reshaping observability

As organizations navigate increasingly complex tech environments, unified observability practices have become essential. That was one of the main takeaways from Grafana Labs Co-founder Anthony Woods’ recent appearance on “Tech Keys by by Mercari India,” a podcast hosted by Vaibhav Khurana, Head of Platform Engineering at Mercari India.

Lighting up your dashboards: How to visualize the CheerLights IoT project in Grafana Cloud

I recently joined the Developer Advocacy team here at Grafana Labs, and have been exploring ways to accelerate my Grafana learning journey. Like many others in the Grafana community, my introduction to the open source project happened when I needed a way to easily visualize data that resided in external databases, mostly using SQL queries.

Monitor Kubernetes Hosts with OpenTelemetry

It’s 3 AM. API latency just spiked from 200ms to 2s. Alerts are firing, and users are frustrated. You SSH into the first server: top, free -h, iostat — nothing unusual. On to the next host. And the next. That’s how most of us learned to debug. The tools worked, and we got good at using them. But as infrastructure became distributed and dynamic, this approach started to break down. Modern monitoring needs more than SSH and top. It needs unified telemetry.

How SSL Certificate Monitoring Prevents Man-in-the-Middle Attacks

Man-in-the-Middle (MITM) attacks remain one of the most dangerous cybersecurity threats. In these attacks, hackers secretly intercept and sometimes alter communication between two parties. Without proper encryption, sensitive data such as passwords, credit card details, and personal information becomes exposed. SSL/TLS certificates encrypt this communication, preventing unauthorized access. However, certificates can expire, become misconfigured, or become compromised, creating security gaps.

Why Healthcare CIOs Are Becoming Transformation Leaders, Not Just Tech Leaders

The role of the healthcare CIO looks nothing like it did a decade ago. Running the EHR, keeping infrastructure online, and managing vendor contracts are still table stakes, but they’re no longer the whole story. Today’s healthcare CIOs are being asked to do something far bigger: lead enterprise-wide transformation.

Building a VictoriaMetrics PaaS: The What, Why, and "Easier Button" - Tech Talk #9

Ready to tame your monitoring complexity? Join Mathias and Marc for the first episode of our brand-new series dedicated to building a robust, scalable, and user-friendly VictoriaMetrics Platform as a Service (PaaS)! As organizations grow, managing monitoring infrastructure becomes a major challenge. This series provides a practical, step-by-step guide to building your own VictoriaMetrics-based PaaS to reduce developer friction, improve reliability, and save on costs.

Integrations Overview

This video provides a detailed tour of our integrations, including how to set up automated email, SMS, and phone call alerts. Learn how to connect with various trusted tools, tailor your alerts to your team's needs, and pass key data between Uptime.com and your favorite applications. Discover how to add and manage new integrations, create dedicated contacts, and assign integrations to specific checks. We also introduce our Zapier partnership, enabling connections to over 8000 additional services.

How to analyze observability and monitoring tools for actionability

Choosing the right observability tools is critical so ensure your teams get actionable insights. In this video, we explore how to evaluate observability platforms based on their ability to detect anomalies, link causes, and trigger effective responses.

Visualize Jenkins CI/CD Pipelines: Introducing the New Jenkins Data Source Plugin in Grafana 12.2

Grafana 12.2 introduces the new Jenkins data source plugin, giving you real-time insights into your Jenkins CI/CD pipelines. With easy setup, you can connect your Jenkins instance and explore two built-in dashboards: See how Jenkins data becomes instantly actionable inside Grafana.

Introducing the UptimeRobot v3 API

As you may have noticed, we released the latest version of the UptimeRobot API a few weeks ago. Don’t worry, the v2 will remain available; however, it will no longer receive support or updates. New features will be added only to v3. Built on a RESTful architecture, v3 unlocks more flexibility, cleaner workflows, and expanded capabilities for developers who want tighter control over their monitoring. Below, we’ll highlight what’s new and how it compares to the legacy v2 API.

Infinity Data Source Now Supports Auth for Actions | Grafana 12.2

Grafana 12.2 introduces actions authentication with the Infinity data source — giving you more secure and flexible ways to trigger actions. Previously, actions were limited to browser-based HTTP requests subject to CORS. Now you can choose between browser requests or Infinity connections, leveraging preconfigured authentication settings. This update makes actions more powerful and reliable in Grafana 12.2.

Securing the Future: Responsible AI on AWS with Sumo Logic -- Customer Brown Bag -- Sept 25th, 2025

This session with Moumita Saha, Sr. Security Partner SA – WW Consulting Partners, AWS, and Adam White, Sr. Dir. Technical Marketer at Sumo Logic explores how AWS and Sumo Logic partner to deliver practical strategies for securing generative AI applications, ensuring they remain safe, compliant, and trustworthy.

Complete Guide to HAProxy Visibility Using Promtail and Loki

HAProxy is the workhorse in front of countless APIs and apps because it’s fast, lean, and flexible. Because it sits on the traffic hot path, it’s also your earliest warning system when something slows down or breaks entirely. This means that monitoring it isn’t optional. You need to see connection queues and retries, per-stage timings, health-check failures, and spikes in error statuses to catch incidents before users do.

Diagnose slow database queries in Node.js: Why Monitoring is Essential?

Node.js is popular for building scalable applications because its non-blocking architecture can handle many requests at once. But when your app depends on a database, performance hinges on how efficiently queries run behind the scenes. Even a single slow database query can block the Node.js event loop, causing delayed responses, frustrated users, and cascading performance issues. Too often, teams only notice these problems after customers experience lag or timeouts.

How Nexus BMS Uses Time Series and AI to Power Smarter Buildings

Monitoring equipment isn’t enough for today’s smart buildings; true value comes from being able to predict issues, optimize performance, and take action automatically. Traditional building management systems often fall short, limited to dashboards and alarms that only notify you of an issue after the fact. With the rise of open source hardware, modern databases, and AI-driven diagnostics, facilities can now move from reactive to proactive management.

Monitor your data pipelines with Airflow lineage

In complex data pipelines with dozens of jobs and intermediary datasets, it can be difficult to effectively monitor how data travels and changes through various steps. When tracking issues in these pipelines, you need visibility into upstream components where the root cause may originate from, as well as downstream datasets and consumers of data that may be experiencing further impacts.

Soft navigations: The future of seamless browsing

In the ever-evolving world of web standards, a new experimental feature is quietly reshaping how browsers perceive navigation: Soft Navigations. While still in the early stages, this concept has the potential to redefine user experience metrics, improve performance monitoring, and better align browsers with the behavior of modern web applications. Let’s dive into what soft navigations are, why they’re important, and how you can start exploring them today.

New Grafana One-Page Report (Public Preview) | Grafana 12.2

Grafana 12.2 introduces a redesigned reporting feature, now in public preview. The new one-page report creation flow replaces the old multi-step wizard, making it easier and more intuitive to schedule and share insights. You can now: Check out how the new reporting experience simplifies sharing data in Grafana 12.2.

Grafana 12.2 release: LLM-powered SQL expressions, updates to canvas and table visualizations, simplified reporting, and more

Grafana 12.2 has arrived, delivering new features to help you and your team move from data to decisions faster than ever. Grafana 12.2: Download now! Below are just some of the highlights from the latest Grafana release.

LLM Observability in the Wild - Why OpenTelemetry should be the Standard

A few days ago I hosted a live conversation with Pranav, co-founder of Chatwoot, about issues his team was running into with LLM observability. The short version: building, debugging, and improving AI agents in production gets messy fast. There's multiple competing standards for default libraries for LLM observability. And many such libraries like OpenInference which claim to be based on OpenTelemetry don't strictly adhere to it's conventions.

Availability Summary Report in Site24x7

Track uptime and downtime at a glance with the Site24x7 Availability Summary Report. In this video, we break down each section of the Availability Summary Report when a single monitor is chosen, including monitor availability, suspension summary, outage details, Mean Time To Repair, Mean Time Between Failures, and location-based metrics. Learn how to use this report to validate downtime, analyze performance trends, and ensure service reliability.

InfluxDB 3 Core: Open Source, Recent-Data Engine

Dive into InfluxDB 3 Core, an open source, high-speed recent-data engine. InfluxDB 3 Core is an open source, high-performance real-time data engine (MIT/Apache 2 licensed). It’s built for real-time monitoring, edge data collection and transformation, sensor alerting, and streaming analytics with simplicity and speed.

Session Replay: Becoming your own digital secret shopper

Retail stores have long relied on a secret weapon to measure and improve the shopping experience: the secret shopper. Posing as ordinary customers, they evaluate the customer experience, spotting friction points like hard-to-find items, gauging the quality of customer service, and testing how seamless the checkout process feels.

Create New Alert Rules Without PromQL Queries in Grafana 12.2 | Metrics Drilldown

Grafana 12.2 makes alert creation simpler by integrating the Metrics Drilldown app with the Alert Rule Query Editor. Instead of writing PromQL from scratch, you can now use a queryless workflow: explore metrics, add labels, and generate queries directly from Drilldown. This helps teams move faster and makes alerting more accessible for those new to PromQL.

CriblCon sneak peek with AlphaSoc

The countdown to is on and we’re giving you an exclusive first look at the expert insights, innovative solutions, and success stories you’ll see on the big stage. Join us as we chat with Chris McNab, Founder of AlphaSOC, a security startup that processes network telemetry to uncover infected hosts, emerging threats, and targeted attacks.

Certificate Rotation with Progress-powered Solutions

Don’t let expired certificates put your organization at risk! Progress WhatsUp Gold makes it easy to discover, manage, and automate certificate lifecycles across your network. With powerful automation from Progress Infrastructure solutions, you can rotate and manage certificates without a manual routine to maintain compliance and security. Schedule updates, push certificates to thousands of nodes and maintain governance with built-in traceability. Experience simplicity, scalability and seamless integration with Progress-powered solutions.

Telemetry Now Teaser: "What's the real cost of delivering this traffic?"

Why is it so difficult to answer the age-old question CFOs are asking, "What's the real cost of delivering this traffic?" Complex billing structures and cost modeling are only part of it. Lauren Basile joins Phillip Gervasi to discuss turning network telemetry into financial insight in the latest episode of Telemetry Now.

Top 3 MSP dashboards compared: SquaredUp, BrightGauge and MSPbots

Managed Service Providers (MSPs) live and die by their data. Externally, clients expect clear reporting, fast responses, and visible proof of value. Internally, smooth operations and low overheads are essential to business success. But with so many tools, key data is scattered across multiple systems – PSA, RMM, cloud services, ticketing, monitoring, finance – and that causes blind spots. Dashboards fix this problem by consolidating data into a single view.

You don't need a real outage to find your weak spots.

Modern digital services rely on complex systems, and chaos can strike at any layer. But the most effective teams don’t wait for failure to learn. They simulate it. By introducing controlled performance degradations, you can stress your systems, test your dependencies, and uncover hidden risks without touching production. In our latest webinar, Catchpoint experts walk through how teams are building resilience through proactive, safe failure testing, and why it’s become a cornerstone of digital reliability.

Mute timing vs. silences in Grafana Alerting: How to pick the best fit for your use case

Have you ever been in a situation where know your team is going to run their weekly maintenance window and you silence your notifications to prevent a flood of false positives from pinging your inbox? If you are associated with a team that uses any type of alert system, you know how easily alert fatigue can happen. The incessant and unpredictable (or even, at times, predictable) pings, emails, and notification alerts can drive even the most serene worker totally batty.

What is SNMP Trap: Real-Time Alerts for Network Monitoring

Why wait for the next poll? An SNMP trap is a real-time alert sent from a device to a monitoring system, without waiting for polling. Ever had a router die silently at 3 AM while your monitoring system was still polling away every 5 minutes? Yeah… not fun. That’s where SNMP traps step in. Think of them as the push notifications of network monitoring: instant, lightweight, and sometimes misunderstood.

Distinct Value Cache in InfluxDB 3

The Distinct Value Cache in InfluxDB 3 speeds up metadata queries and tag value lookups for faster, more responsive UIs. The Distinct Value Cache in InfluxDB 3 delivers sub-30 ms lookups for tag values and series metadata, making exploratory queries and UI dropdowns quick and responsive. By reducing latency on these common operations, it allows developers to build real-time monitoring and analytics tools without extra complexity.

Monitor and optimize your systems with Uptrace

Uptrace is your single source of truth for monitoring, understanding, and optimizing complex distributed systems. Proven in production for over five years and trusted by more than a thousand installations worldwide, it lets you see your system like never before. What makes the difference is that Uptrace is pure OpenTelemetry, built natively from day one. This isn't a translation layer—it's a direct connection that eliminates friction and ensures zero vendor lock-in. Your homepage serves as your command center, providing complete visibility across your stack at a glance.

How to Push Prometheus Metrics to Splunk Observability Cloud with the OpenTelemetry Collector

In this video, you’ll learn how to scrape Prometheus endpoints with the OpenTelemetry Collector’s Prometheus receiver and send metrics to Splunk Observability Cloud. We’ll walk through configuring three common data sources (a Python Flask app, node_exporter for host metrics, and the NGINX Prometheus exporter), show how to enrich metrics with resource attributes, and build simple charts in Splunk Observability Cloud. You’ll see how centralized scraping and consistent tagging make it easy to manage and visualize Prometheus metrics in Splunk Observability Cloud.

An overview of Context Propagation in OpenTelemetry

To effectively manage modern applications, you need to understand how they work on the inside. Distributed tracing is the key to this, providing a detailed picture of a request's journey across every service. OpenTelemetry has emerged as the industry-standard framework for implementing tracing and achieving true observability in complex, distributed systems. In this article, we embark on a journey to explore the core concept of context propagation within Open Telemetry.

Creating a Sustainable Open Source Business Model - Introduction

Open source defies everything you’ve ever heard or learned about business before (author’s quote). Yes, open source software has been around since the 90s, but there’s still little else like it. If anything, as time has gone on, we’ve added adjacent concepts like “open core” and “source available” that have added complexity to a model that isn’t that straight forward to grasp to begin with. VictoriaMetrics is an open source company.

How to Responsibly and Effectively Contribute to Open Source Using AI

With the influx of AI tooling, it’s never been easier to contribute to open source communities. These tools are capable of gathering context quickly, “understanding” repositories faster than ever before. They provide instant summaries about repositories that, previously, would have meant reading lines and lines of code. They can fix bugs in programming languages you don’t know, and ultimately allow more contributors to get involved, which (almost) every open source project wants.

Creating and using a Network Discovery Profile in Site24x7

Learn how to create and use a Discovery Profile in Site24x7 to simplify and automate network device onboarding. In this video, we walk you through setting up discovery parameters, applying filters and thresholds, grouping and tagging devices, configuring alerts, integrating with ITSM and collaboration tools, and scheduling periodic rediscovery. Whether you're managing a single site or multiple customer environments, Discovery Profiles help you.

Kubernetes monitoring explained: Key metrics, labels, and best practices

Monitoring Kubernetes and containers doesn’t have to be overwhelming. In this video, we’ll break down the essential metrics you need to track, why labels are critical for container visibility, and the best practices for Kubernetes monitoring at scale. You’ll learn: How tools like Site24x7 simplify Kubernetes monitoring with auto-discovery, dashboards, anomaly detection, and forecasting. Whether you’re a DevOps engineer, SRE, or developer, this video gives you the practical knowledge to improve container monitoring and observability.

Two Decades of Microsoft SCOM & Monitoring Expertise

In today’s complex IT environments, reliable monitoring isn’t optional — it’s essential. From critical infrastructure in Government & Defense to highly regulated sectors like Healthcare, Energy, and Finance, organizations worldwide trust NiCE to deliver secure, future-ready monitoring solutions.

Memory stall: the agony before OOM

When we set a memory limit for a container, the expectation is simple: if the app leaks memory, the OOM killer steps in, the container dies, Kubernetes restarts it, done. But reality is messier. As a container gets close to its memory limit, allocations don’t just fail instantly. They get slower. The kernel tries to reclaim memory inside the cgroup, and that takes time. Instead of being killed right away, your app just crawls.

Building Real-Time Data Pipelines with Kafka, Telegraf, and InfluxDB 3

When milliseconds matter and data never stops flowing, you need a pipeline that can handle high-velocity streaming data with reliability and scale. The modern streaming stack of Kafka, Telegraf, and InfluxDB 3 Core delivers exactly that. To give you a concrete example, this blog works with a fictitious use case: “Papa Giuseppe’s Pizzeria.” Every oven, prep station, and order in this pizza restaurant generates data. Our workflow looks like this.

Beyond Automation: The Rise of Agentic Networks

Agentic AI is the next evolution in network management, moving beyond simple automation to intelligent systems that can reason, plan, and act autonomously. Justin Ryburn, Kentik Field CTO, highlights how this shift automates expertise, enables proactive problem-solving, and empowers human engineers for strategic innovation.

InfluxDB 3 Enterprise: Deploy Your Way, Scale on Demand

InfluxDB 3 Enterprise is engineered for performance and designed for flexibility, delivering high-scale, production-ready time series data management with operational simplicity. InfluxDB 3 Enterprise is built on a cloud-native, diskless architecture that removes the limits of traditional storage. It’s easy to deploy, scales effortlessly, and eliminates the complexity of managing clusters so you can deploy your way and meet the unique demands of your environment.

Node.js Event Loop: Why Monitoring Matters

Node.js has become a cornerstone for modern application development because of its non-blocking and asynchronous architecture. According to Stack Overflow Developer Survey, Node.js remains among the most widely used technologies for web applications, powering millions of services globally. While this event-driven model provides scalability and efficiency, it also introduces challenges.

Automate Your Infrastructure Analysis with Scheduled AI Reports

The least exciting part of an operations or SRE role is often the manual, repetitive task of generating reports. It’s the Monday morning scramble to summarize weekly infrastructure health for the team, or the end-of-quarter push to build a capacity planning document. This is boilerplate work that pulls you away from critical engineering tasks. We believe that if a process is repeatable, it should be automated. That’s why we’re introducing Scheduled AI Investigations and Insights.

10 Best Practices for Proactive Database Performance Monitoring to Prevent Downtime

Databases are the core of modern applications, whether it is an e-commerce platform, a banking system, or a social media app. Slow database performance or unexpected downtime can cause serious problems, from lost revenue to poor customer experience. Proactive database performance monitoring helps teams identify issues before they escalate. Unlike reactive monitoring, which only addresses problems after they occur, proactive monitoring ensures your database remains fast, stable, and reliable.

How to perform real-time DNS monitoring in Grafana Cloud

When DNS or domain name server resolution processes fail, or become sluggish, users can experience timeouts, connection errors, and degraded performance — often without clear indication of the root cause. This is where DNS checks in Grafana Cloud Synthetic Monitoring come in, allowing you proactively monitor domain name resolution, verify that domains resolve to the correct IP address, and even measure how quickly that resolution occurs.

Key APM Metrics You Must Track

Application Performance Monitoring (APM) helps you understand how your software runs in production. When you track the right metrics, you see how requests move through your system, where slowdowns happen, and how resources are being used. With this knowledge, you can spot issues early and keep your applications reliable for your users. In this blog, we discuss the key APM metrics to monitor, grouped into categories, and why each one matters for performance and user experience.

Your Next Observability RFP is All Wrong. Why AI Changes Everything

AI-first observability addresses two of the most pressing troubleshooting challenges: complex IT environments and AI-generated code. But understanding how to implement AI in a way that brings ROI, requires cutting through the hype and maintaining realistic expectations, while keeping a forward-thinking vision. In this blog post, we bring practical tips for including AI in your next observability RFP. The article is based on a webinar held with Logz.io founders, CEO Tomer Levy and CTO Asaf Yigal.

New Relic's CCU-based pricing is creating unpredictable costs, pushing teams to sample heavily

We talked to 7 companies in August 2025 who were looking to switch from New Relic. One engineering director said they're paying $1,000 a month and only ingesting 10% of their traces. Teams are defaulting to aggressive sampling, some at 1%, others at 10%, to manage costs.

OpenTelemetry and Jaeger | Key Features & Differences [2025]

OpenTelemetry is a broader, vendor-neutral framework for generating and collecting telemetry data (logs, metrics, traces), offering flexible backend integration. Jaeger, on the other hand, is focused on distributed tracing in microservices. Earlier Jaeger had its own SDKs based on OpenTracing APIs for instrumenting applications, but now Jaeger recommends using OpenTelemetry instrumentation and SDKs. Warning The original Jaeger client SDKs (based on OpenTracing) are archived and no longer maintained.

Sentry AI code review, now in beta: break production less

This could’ve been prevented. This should have been prevented. This too. We all hate getting tagged in PRs. The time, the blame for when you inevitably miss something, and constant “I wouldn’t have written it that way” feeling is just hard to shake. LLMs promised this would get easier. Promised they would do it for us. But as we’ve seen, we’re not there yet. But this is what Sentry does for a living. We catch bugs… in prod.

ICMP Monitoring: What Is ICMP & How It Works

Ever “pinged” a server and wondered what those milliseconds actually mean? If you’re a network admin or IT pro, you already use ping as a quick sniff test. But ICMP is more than a green checkmark or a scary timeout. In this article, we’ll define ICMP, walk through how echo requests and replies work, and show how to turn basic pings into useful network and ICMP monitoring.

Zooplus Found Faster Root Cause Detection with Elastic Observability

Zooplus Platform Engineering Lead Aram Hakobayan shares how Elastic Observability helps manage 3,000+ microservices and 15,000+ logs/sec across their AWS cloud. Learn how Elastic powers their French market, centralizes monitoring, simplifies root cause analysis, and avoids costly vendor migration. Ideal for DevOps, SREs, and cloud architects scaling fast.

The one where we talk about Cribl Guard

Manual hunts for sensitive data are slow, error-prone, and expensive. Cribl Guard combines advanced AI with a human-in-the-loop control point to spot sensitive data, such as credit card, passport, and Social Security numbers, as it flows through Cribl Stream. Whether you’re fully cloud or hybrid, Cribl Guard puts you firmly in control of every piece of sensitive information that crosses your pipes.

AI in Server Monitoring: Why Human Context Still Matters in 2025

When Microsoft rolled out Windows Server 2025 last November, it marked a turning point in how IT teams think about monitoring. Suddenly, AI-powered features like anomaly detection, predictive resolution, and even self-healing aren’t ideas on a roadmap — they’re built into the very fabric of enterprise infrastructure.

Integrating JMX and OpenTelemetry

The OpenTelemetry community and the contributors to the Java Special Interest Group (SIG) have spent a great deal of time integrating core Java technologies into the project. An integration that is particularly useful is Java Management Extensions (JMX). It has been around since J2SE 5, and has been mature for some time. Many of the most widely used Java applications have adopted it over time and support this extension.

Introducing Request Mirror: a free micro-service to reflect HTTP requests

We have launched Request Mirror, a little free service to reflect HTTP requests. We've also open-sourced it: you can read the code in the ohdearapp/request-mirror.ohdear.app repo on GitHub. In this blog post I'd like to explain why we built it and how you can use it.

Reddit to Reality: Top 7 Omnissa Horizon Performance Issues and Fixes

Slow logons, laggy VDI sessions, and poor Horizon performance are common pain points IT admins face frequently. When Omnissa Horizon environments slow down, both end-users and IT teams feel the pressure—users grow frustrated while admins struggle to troubleshoot without complete visibility. To uncover the real-world Horizon issues admins face, we turned to Reddit forums like r/VMwareHorizon, r/sysadmin, and r/Citrix, where IT professionals openly share their Horizon troubleshooting struggles.

Burndown and burnup: Two charts every engineering dashboard needs

As engineering organizations scale, project visibility becomes a real challenge. Engineering managers lose track of what's actually happening across multiple teams. Executives ask "are we on track?" and get conflicting answers. Status meetings multiply but clarity doesn't improve. The root problem isn't lack of data, modern engineering teams generate tons of project information across JIRA, GitHub, CI/CD pipelines, and project management tools.

How to Connect Jaeger with Your APM

Microservices make it tough to understand how applications behave end-to-end. Most teams already rely on an Application Performance Monitoring (APM) tool to track system health. But as requests move across many services, you also need distributed tracing. Jaeger gives you that visibility. The real value comes from connecting the two. Instead of running APM and Jaeger in silos, you can combine their strengths, metrics from your APM, and traces from Jaeger, to get a clearer view of performance.

Grafana & Friends Stockholm meetup at 0+X

In this talk, we’ll introduce the Kafka Data Source plugin we developed for Grafana, which enables users to query and visualise Kafka topic data directly in their dashboards—without the need for intermediate storage or external services. We'll share how the idea came about, how we collaborated with the Grafana community and developers to bring it to life, and the challenges we faced along the way. We'll also discuss our vision for the plugin’s future and its role in the evolving observability landscape.

From Shadow AI to Strategy: The Six-Month AI Imperative (w/ Charlene Li)

In this very special episode of The DEX Show, we welcome back one of the world’s most influential voices on digital transformation and the future of AI leadership: Charlene Li. Charlene is a bestselling author and trailblazing thinker who has helped leaders navigate disruption for over two decades. She returns to the show for an unmissable conversation on the realities of AI Transformation—and what it means for organizations, leaders, and employees at every level.

OpenTelemetry Exporters - Types and Configuration Steps

In this post, we will talk about OpenTelemetry exporters. OpenTelemetry exporters help in exporting the telemetry data collected by OpenTelemetry. OpenTelemetry frees you from any kind of vendor lock-in by letting you export the collected telemetry data to any backend of your choice. In modern distributed systems, efficiently collecting, transmitting, and analyzing telemetry data from diverse sources poses a significant challenge.

How Nexthink Enables Smarter, Data-Driven Hardware Refresh Strategies

Bob did—until he realized there was a smarter way. With real-time insights from Nexthink, he stopped guessing and started making data-driven decisions: keeping the high performers, replacing the real troublemakers, and upgrading the underperformers. The result? Happier employees, optimized budgets, and devices refreshed based on need—not age.

SquaredUp Cloud + Dashboard Server

SquaredUp Dashboard Server (DS) and SquaredUp Cloud both deliver cutting-edge data visualization for IT and engineering teams. The two products can be used independently, or together for complete operational visibility. This article explores how SquaredUp DS and Cloud differ, when to use each, and how they work together.

Synthetic Monitoring Frequency: Best Practices & Examples

Synthetic monitoring is, at its core, about visibility. It’s the practice of probing your systems from the outside to see what a user would see. But there’s a hidden parameter that determines whether those probes actually deliver value: frequency. How often you run checks is more than a technical configuration—it’s a strategic choice that ripples through detection speed, operational noise, and even your team’s credibility.

Instrumenting the Node.js event loop with eBPF

Recently, I was testing Coroot’s AI Root Cause Analysis on failure scenarios from the OpenTelemetry demo. One of them, loadgeneratorFloodHomepage, simulates a flood of excessive requests. As expected, it caused a latency degradation across the stack. Coroot’s RCA highlighted how the latency cascaded through all dependent services. At the same time, we noticed a moderate increase in CPU usage for the frontend service and the node itself.

AWS Prometheus: Production Patterns That Help You Scale

You've got Prometheus running in one cluster — maybe a dev environment, a single EKS cluster, or a proof-of-concept setup. The configuration is straightforward: node_exporter on a few EC2 instances, some service discovery for pods, and a single Prometheus server scraping everything. Storage is local, retention is 15 days, and you can keep all the default recording rules without worrying about costs.

I turned error messages into a sales machine (by accident)

Dan Mindru is a Frontend Developer and Designer who is also the co-host of the Morning Maker Show. Dan is currently developing a number of applications including PageUI, Clobbr, and CronTool. I find it remarkable that we’re getting so many AI startups every day. As software engineers, most of us like to know what our software is actually doing. We plan, review, and perform automatic tests to verify it’s working as expected. Then we do a round of manual testing for good measure. Not with AI.

Solve Microsoft Teams Performance Troubles Before They Hit Your Inbox

Who Solves It Faster? Microsoft Native Tools vs. Vantage DX Tickets piling up. Execs on your case. Teams acting up. Microsoft’s tools only show part of the story—leaving you stuck reacting. Watch our pros do a no-fluff, side-by-side showdown: Microsoft Native Tools vs. Martello Vantage DX. Watch them tackle real Teams issues and see who finds and fixes the problem faster. What Attendees will learn.

Elastic Cloud Serverless on Google Cloud doubles region availability

We’re pleased to announce the availability of Elastic Cloud Serverless on Google Cloud in three new regions: This doubles the number of available regions on Google Cloud and dramatically increases serverless deployment options in the US. Elastic Cloud Serverless provides the fastest way to start and scale observability, security, and search solutions without managing infrastructure.

How Nexthink Enables Data-Driven Software License Reclamation

This was what Sarah was looking to solve. ⁠Managing software licenses isn’t just tracking installs—it’s about uncovering hidden usage and reclaiming wasted spend. When Sarah faced $12M in software costs, scattered licenses, and zero visibility, she needed a better way. With Nexthink, she gained real-time insights, smart user nudges, and automated reclamation. ⁠The result?

Chaos to Choreography: How To Automate IT Operations with Nexthink Flow

Taylor proved it—turning license headaches, VPN chaos, SCCM continuity and patch pain into a smooth, confident performance.⁠⁠Here's how Taylor did it.⁠ In the end, IT isn’t just about fixing—it’s about flowing, scaling, and making work effortless.⁠Request a demo today.

The Strategic Imperative: Transforming Platform Sunset into Competitive Advantage

With innovation cycles accelerating, product end-of-life announcements have become an inevitable reality. Infoblox NetMRI, for example, has reached end of life with license sales ending April 2025 and support shutting off by early 2027. Whether it’s a network management platform, IT monitoring system, or enterprise application, the sunset of critical business tools forces organizations into what many view as disruptive, costly transitions.

Modern E2E Testing with Playwright and AI

Pair Playwright with LLMs to plan, generate, refactor, and monitor end-to-end tests, without shipping hallucinations. This webinar showcases practical workflow: ground models with fresh docs, driving the browser via Playwright MCP, auto-fixing failing tests, refactoring to POMs, add API checks, and reusing the same suite for synthetic monitoring in Checkly. Chapters.

Modern Monitoring, Zero Blackouts: High Availability Reimagined

Downtime is an expensive inconvenience. Yet many IT teams still face monitoring blackouts due to rigid licensing models and outdated failover strategies. In this session, we’ll introduce a smarter approach: High Availability by Design. Whether you're scaling operations or modernizing infrastructure, this session will enable you with the tools and insights to build a resilient, future-ready monitoring strategy.

From Firefighting to Proactive Resolution: How Nexthink Transforms Service Desk Operations

Level 1 engineers face incoming tickets without real-time visibility into endpoints. The result? Endless tool-switching, guesswork diagnostics, missed SLAs, and unnecessary escalations. Critical issues remain hidden until they impact productivity.⁠ Then came Nexthink.⁠ Now engineers see issues in real time, fix faster, and even prevent problems users don’t notice.

OpenTelemetry Logs - A Complete Introduction & Implementation

OpenTelemetry is a Cloud Native Computing Foundation(CNCF) incubating project aimed at standardizing the way we instrument applications for generating telemetry data(logs, metrics, and traces). OpenTelemetry aims to provide a vendor-agnostic observability framework that provides a set of tools, APIs, and SDKs to instrument applications.

How AI Turns Monitoring From "What Now?" Into "What's Next?"

It's 3 AM. Your phone starts buzzing with alerts, and you stumble to your laptop only to be greeted by a dashboard that looks like the control panel of a nuclear reactor in meltdown: Red lights everywhere. Numbers that should be green are decidedly not green. And your brain, still foggy from sleep, is asking the most fundamental question in all of IT operations: "Okay, yes, there's clearly a problem... but, now what?".

The Blind Spots That Haunt Legal IT

In a recent survey, Udacity’s team explored the evolving landscape of AI adoption by asking 2000 professionals (including those in the legal sector) if they used AI. Unsurprisingly, over 90% of respondents said they did. More concerning, 72% of managers reported personally paying out of pocket for AI tools to use at work, introducing uncontrolled risk into corporate environments.

How GenAI is Shaping Elastic Customer Support

Discover how GenAI has accelerated Elastic's customer and support efficiency. Built on Elastic’s Search AI Platform, the Support Assistant delivers self-service in-product customer support and capacity gains within our support function. Julie Rudd, VP of Support at Elastic, shares how it speeds up issue resolution by combining generative AI with Elastic’s deep knowledge base. Hear directly from a support engineer how the Support Assistant streamlines case resolution and helps engineers and customers find answers faster.

Why Has Network Management Missed Its Own Revolution?

We love to talk about IT revolutions. We celebrate the leaps in innovation that change how we work and live. We look at the 1980s and see the personal computer, which turned computing from a command-line chore into an intuitive experience for everyone. We point to the 1990s as the decade the internet connected the world, the 2000s as the era when virtualization and the cloud broke the chains of physical hardware, and this decade as the dawn of mainstream AI. Each of these moments was transformative.

Why IT Teams Still Struggle with Shadow IT in 2025

Many businesses are still struggling with shadow IT. What is Shadow IT? Any software or hardware, including that of cloud services, which are used without explicit knowledge of the company's IT department, is referred to as shadow IT and is highly dangerous for any business. Not only does it pose significant security risks like data breaches and increased vulnerability to cyberattacks, but it also puts employees at risk.

Automated BSoD (Blue Screen of Death) Monitoring and Troubleshooting

Yes, BSoDs are still cropping up in high-impact ways in 2025, from flawed Windows updates (especially 24H2 patches) to driver rollouts and heavily-threaded server environments. It remains essential for IT admins to track event reports, test updates in staging, enable rollback strategies, and be prepared with recovery mechanisms.

Monitor and optimize your systems with Uptrace

Uptrace is your single source of truth for monitoring, understanding, and optimizing complex distributed systems. Proven in production for over five years and trusted by more than a thousand installations worldwide, it lets you see your system like never before. What makes the difference is that Uptrace is pure OpenTelemetry, built natively from day one. This isn't a translation layer—it's a direct connection that eliminates friction and ensures zero vendor lock-in. Your homepage serves as your command center, providing complete visibility across your stack at a glance.

Observability Day San Francisco: The Future of AI and Observability Is Bright

AI and observability are no longer separate conversations—they’re deeply intertwined. Across keynotes, panels, and demos, speakers at Honeycomb's Observability Day San Francisco unpacked what that means for engineering teams today: faster insights, smarter tools, and new challenges to solve.

OpenTelemetry Observability: An In-Depth Look at Features and Best Practices

OpenTelemetry (OTel) is a unified framework of APIs, SDKs and tools, for collecting, processing, and exporting telemetry data (logs, metrics, and traces) across applications and infrastructure. OTel is especially required in today’s cloud-native world, where applications run on microservices, Kubernetes, and distributed systems.

Database Monitoring Challenges Every DevOps Engineer Should Know

Databases form the critical foundation of modern applications, and maintaining their performance and reliability is essential for operational efficiency and user satisfaction. Effective database monitoring however presents numerous challenges. Modern systems produce extensive metrics, operate across diverse environments, and must scale in line with growing workloads, all while ensuring compliance and security.

LLM app Observability: Opentelemetry as a standard

LLM observability is broken There are too many new libraries floating around, but they don't follow accurately the OpenTelemetry conventions. OTel isn’t perfect for LLMs yet—but extending a proven standard beats inventing another one. Why not use the same standard (OTel) which works so well for rest of the apps, and just work on top of it? This is what I was ranting with Pranav Raj S, co-founder at Chatwoot and we thought there must be other folks facing similar issues.

Internal SLAs for Third-Party Vendors: Complete Guide

Managing third-party vendors effectively requires clear expectations and measurable standards. Internal SLAs for third-party vendors provide the framework to track vendor performance, ensure compliance, and maintain service quality across your entire vendor ecosystem. This guide covers everything you need to establish and manage vendor SLAs that protect your business interests while fostering productive vendor relationships.

Your Next Observability RFP Is All Wrong: Why AI Changes Everything

Watch how AI is reshaping observability for the years ahead. In this fireside chat, Logz.io founders Tomer Levy and Asaf Yigal reveal how the most innovative AI-first companies are breaking free from dashboards, avoiding common RFP mistakes, and building future-ready stacks. You’ll see: Watch and learn how autonomous AI eliminates noise, slashes costs, and gives engineering teams back their velocity.

Proactively monitor Kerberos-authenticated web apps and APIs with Datadog Synthetics

When employee authentication fails or becomes unreliable, users can lose access to the critical systems they need. Authentication enables access to internal tools like HR applications, finance portals, and internal dashboards, so even short outages can interrupt day-to-day work, while persistent issues increase the risk of broader operational disruption.

Single-tenant vs. multi-tenant architecture with Grafana Cloud: A guide to choosing the right approach

Grafana Cloud’s flexibility is one of its greatest strengths, but the breadth of choices can sometimes be overwhelming. We see this a lot when it comes to selecting the right architectural approach, with organizations unsure of how many stacks they need to host their environment. Grafana Cloud provides robust features for managing tenancy, enabling organizations to effectively handle diverse teams and projects.

Datadog in the Era of AI

AI is changing everything. At Datadog, our approach is two-fold: empower you with complete observability across your entire stack, including AI as you incorporate it, and harness emergent technologies to make Datadog even more powerful. Join VP of Product Michael Whetten to see how Datadog is accomplishing these two approaches. He'll share the latest feature updates and new products designed to help you thrive in an AI-powered world. Plus, get a look at our long-term vision for the future of AI and its impact on your work.

Bridging the Network Cost Gap: Why Operators Need Real-Time, Traffic-Based Cost Intelligence

Jezzibell Gilmore’s latest blog dives into the critical challenge network operators face: bridging the gap between massive traffic growth and understanding its actual cost. Learn why real-time, traffic-based cost intelligence is no longer optional for maintaining margins and driving revenue in today’s complex network landscape.
Sponsored Post

Implementing Agentic AI: A Technical Overview of Architecture and Frameworks

As businesses strive for smarter, faster operations, Agentic AI redefines enterprise operations, introducing solutions for autonomous decision-making and tackling complex challenges with precision. Agentic AI introduces an intelligent, enterprise-focused approach to enhancing operational efficiency and adaptability, paving the way for innovation. Its ability to support operational scalability and streamline workflows positions it as a vital tool for modern IT ecosystems.

ManageEngine named in the 2025 Gartner Magic Quadrant for AI Applications in ITSM

We're proud to announce that ManageEngine has been recognized in the 2025 Gartner Magic Quadrant for AI Applications in ITSM. This recognition comes after Gartner's comprehensive evaluation of our Completeness of Vision and Ability to Execute. We believe this recognition reflects our commitment to making AI-driven ITSM cost-effective, easy to implement, and scalable to meet modern enterprises' growing needs.

Why AIX Automation Starts with Better Monitoring: How Galileo Powers Smarter Action

If your automation can’t trust the data it’s acting on, it’s not automation. It’s a guess. That’s why AIX automation monitoring is the foundation for success. Many teams encounter this gap when trying to automate AIX operations. Red Hat Ansible Automation Platform (AAP) and Event-Driven Ansible (EDA) can absolutely streamline routine tasks, like expanding filesystems or tuning adapters. But every playbook still depends on one thing: accurate, real-time monitoring.

Track the performance of your HPC workloads with Datadog's AWS PCS integration

AWS Parallel Computing Service (AWS PCS) is a managed service that helps users run and scale their high performance computing (HPC) workloads. AWS PCS uses Slurm, an open source workload manager, for scheduling and orchestrating simulations, which enables users to build their scientific and engineering models in a familiar HPC environment.

Announcing Dynamic Service Insights in LogicMonitor Envision

If you’re in IT operations, you’ve likely faced the disconnect firsthand: your dashboards say everything’s green, but your business stakeholders are asking why the website is slow, the customer portal is timing out, or a regional service is underperforming. Your team is usually on top of issues, such as monitoring infrastructure health, resolving alerts, and keeping systems online. But the business isn’t looking at device uptime.

Redefining Resilient IT: Edwin AI, Service Intelligence, and What's Next for LogicMonitor

Downtime is more than an inconvenience these days, nor is it solely a problem for the ITOps team. Since every organization is a digital business, downtime can cost millions of dollars per hour, stall innovation, and erode customer trust. Yet most IT teams are still trapped in reactive mode, scrambling across fragmented tools and drowning in alert fatigue. That model no longer works. The future of IT is about foresight, not firefighting.

Future-Proofing Your Historian with a Time Series Database

As technology scales and data volumes accelerate, organizations face a pressing challenge: how can they modernize data infrastructure without putting daily operations at risk? Data historians, specialized databases that capture and store time-stamped machine and sensor data, have long been the foundation for reliability and compliance. However, they were not designed for the openness and advanced analytics that modern workloads demand.

Making the invisible visible: Are your cloud firewalls and DDoS protection really working?

Every business builds strong defences to keep attackers out. Firewalls and DDoS protection serve that purpose, standing guard over company apps and websites, like knights at the castle gate keeping out trolls (not just the ones on X). But here’s the problem: those defences only work if users actually walk through the front gate. Sometimes, people find hidden paths or side doors around your walls, so the guards never see them enter.

What does the EU Data Act mean for Observability?

The EU Data Act came into effect on January 12th, 2024 and most of its provisions apply from September 12th, 2025. The EU Data Act is designed to give individuals and businesses more control over the data they generate, ensuring fair access, use, and sharing across sectors. For any data generating platform that intends to operate in the European Union, this new legislation matters.

SQL performance improvements: finding the right queries to fix (part 1)

A few weeks ago, we massively improved the performance of the dashboard & website by optimizing some of our SQL queries. In this post, we'll share how we identified the queries that needed work. In the next post, we'll explore how we fixed each of them. We'll cover the basics and gradually work our way up to the more advanced/complex ways of identifying slow queries. In this post, you'll see: Let's go!

LibreNMS + VictoriaMetrics: The Ultimate Monitoring Duo

Get the best of both worlds! Love LibreNMS but need more power for long-term metrics? Integrate it with VictoriaMetrics! We show you a surprisingly simple way to combine these two powerhouses for advanced querying and storage, and even how to A/B test your setup. Want more ways to power-up your monitoring stack? Subscribe for more integration guides and pro tips!

Icinga Experience: Insights from Real-World Icinga Deployments Across Industries

Modern IT environments are hybrid, distributed, and constantly growing. To keep them reliable, organizations rely on monitoring that scales, automates, and integrates seamlessly into existing workflows. We collected 24 Icinga customer stories from industries including finance, telecom, manufacturing, and public services. What unites them is the choice of Icinga as a flexible and cost-efficient alternative to proprietary monitoring tools.

Faster, more memory-efficient performance in Grafana Mimir: a closer look at Mimir Query Engine

Until recently, Grafana Mimir — our open source, horizontally scalable, multi-tenant time series database (TSDB) — has exclusively used Prometheus’ PromQL engine to evaluate queries. While the PromQL engine works great, it sometimes needs a lot of memory to run, specifically in the Mimir querier component. To address this memory consumption issue, we recently introduced Mimir Query Engine (MQE).

What is Asynchronous Job Monitoring?

Modern applications don’t process everything inside the request/response path. To keep APIs responsive, time-consuming work like image resizing, payment processing, or data syncs is moved into background queues. Workers then pick up these asynchronous jobs and run them outside the main thread. Asynchronous job monitoring is the practice of tracking these background tasks: Without this visibility, background workers become a blind spot.

Cribl.Cloud Goes to Washington: Cribl.Cloud Government FedRAMP Authority to Operate Milestone

Way back in 2009, when I was serving as a second lieutenant in the U.S. Army, I worked in a network operations center for a deployed Army unit. Our mission was to provide network connectivity across central and northern Iraq. Our observability tools were incredibly limited. We had a network map that would turn nodes and network links red, yellow, and green when they were up or down. We had to write down in a physical logbook any status changes and what we did about them.

BYOS with Cribl Lake: Data ownership meets flexibility

Today, more than ever, organizations face a difficult balancing act: how to keep sensitive data fully under their control while still making it accessible and usable so teams can unlock the value and insights they need. Industries such as financial services, healthcare, and government agencies often must comply with strict regulations that require data to remain in environments they directly own and manage.

You can't understand digital experience without monitoring from where your users actually are!

If you’re monitoring your applications, you’re missing what your customers are actually seeing. Performance issues don’t happen in a vacuum. They happen at the edges, on mobile devices, over congested networks, in last-mile dead zones. Monitoring only works when it’s aligned with reality. And reality starts at the user.

Cribl.Cloud Government Is a New Era of Secure Cloud Telemetry for Federal Agencies

As a Co-founder and CPO at Cribl, I'm genuinely stoked that our new federal suite, Cribl.Cloud Government, has achieved an “In Process” designation under the Federal Risk and Authorization Management Program (FedRAMP). This isn’t any old milestone. We’re bringing all of Cribl’s kickass capabilities to government agencies, even those that require the strictest compliance and security standards. Because, who doesn’t love a good set of rules?

CloudSpend for iOS 26 for sharper, smarter, and simpler cloud cost management

Experience seamless control, clarity, and cost optimization with the CloudSpend app on iOS 26. This update integrates Apple’s new Liquid Glass design and secure, on-device AI summaries to deliver instant insights into your cloud spending, empowering you to act decisively from anywhere.

Logs & Lattes: Episode 1 - Smart Logging Without the Price Trap

How much value are you really getting from your logs, and what are you giving up to stay on budget? In this episode of Logs and Lattes, host Palmer Wallace sits down with Seth Goldhammer, VP of Product Management at Graylog, for a candid conversation about the hidden cost of traditional SIEM pricing. Seth explains how ingest-based and resource-heavy licensing models pressure security teams into tough tradeoffs, such as dropping logs, tuning down detections, or limiting retention just to avoid budget overages.

The Evolution of Digital Adoption: Insights from Gartner's 2025 Market Guide

The 2025 Market Guide for Digital Adoption Platforms (DAPs) marks an important point in the evolution of the category. Digital adoption has matured from a supporting role into a central part of enterprise strategy. Organizations are no longer asking if they need a DAP—they’re asking which one. In this latest research, Gartner establishes DAPs as essential to business transformation, efficiency, and employee experience. The takeaway is clear: digital adoption is no longer optional.

OpenTelemetry Operator Complete Guide [OTel Collector + Auto-Instrumentation Demo]

Manually deploying and managing OpenTelemetry components in a Kubernetes environment can be a complex and time-consuming task. It involves creating various Kubernetes resources, setting up configurations, and ensuring the components are properly integrated with the applications.

Logstash Alternative: Why Security Teams Are Choosing Modern Data Pipelines

Logstash has been a workhorse in data processing pipelines for years, but it was not designed with today’s security operations in mind. Security teams now deal with massive telemetry volumes, rising SIEM costs, and diverse log formats that require constant normalization. In this environment, Logstash shows its age: manual configuration, outdated parsing, and scalability bottlenecks introduce fragility instead of efficiency.

Distributed performance testing for Kubernetes environments: Grafana k6 Operator 1.0 is here

Performance testing is critical to build reliable applications, but testing at scale, especially inside modern Kubernetes environments, can be a challenge. For example, how do you coordinate tests across multiple nodes, test private services without compromising security, or even do both at once? And most importantly, how do you do all this without adding too much operational complexity to your stack?

Kubernetes Service Discovery Explained with Practical Examples

In Kubernetes, applications are constantly changing — new pods start, old ones shut down, workloads shift across nodes. The challenge is making sure that different parts of your system, and even external clients, can still find each other when the actual locations keep moving. That’s what service discovery handles. It provides a stable way for applications to connect and communicate, no matter where they’re running or how often the underlying infrastructure changes.

Speed improvements to the dashboard, website & job processing

The past month we dedicated time and resources into optimising the speed and experience of our public website, our dashboard and our behind-the-scenes uptime checks that we perform. Overall, our website and dashboard feels about 2x to 3x faster. The biggest gains are for our users that have > 100 sites on their dashboard, they'll get a noticeably faster loading time. For those biggest users, the dashboard is quite litterally 10x faster.

Observability and IT Monitoring Governance: Establishing Order (Part 3 of 4)

In our previous posts, we explored why robust IT monitoring governance is no longer a luxury but a strategic imperative. We highlighted how a disciplined framework prevents blind spots, reduces risk, and ensures the reliability and scalability of your critical business applications. But how do you translate these principles into practical, actionable governance within your IT environment?

Pastries with SREs: OTel me where the cronuts are

In this episode of Pastries with SREs, we tackle an observability debated topic: Do you need a Single Pane of Glass OR is OpenTelemetry a better strategy? We explore: Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

Unlock Real-Time AWS Observability With Streaming Ingestion in DX Operational Observability

In fast-paced cloud environments, traditional monitoring methods often fall short. This leaves teams with latency and data gaps. It’s time to gain near real-time visibility into your AWS telemetry, enabling faster incident response and deeper insights. With its new streaming ingestion capabilities, DX Operational Observability (DX O2) is revolutionizing cloud monitoring—enabling teams to leverage AWS CloudWatch Metric Streams and Amazon Kinesis Data Firehose.

Frontend JavaScript performance testing: A comprehensive guide

When a page pauses for even a quarter-second users feel it, and many will tab away before the spinner stops. Front-end performance testing lets us spot those delays on our own machines instead of reading about them in support tickets. The browser runs JavaScript, layout, painting, and every user interaction on a single main thread. If one task takes too long, everything else queues up behind it.

Top Node.js Application Challenges and How Monitoring Solves Them

Deploying a Node.js application may feel straightforward at first. Everything checks out in tests, staging runs smoothly, and early users run into no problems. But as real traffic ramps up, hidden problems start to appear in unexpected ways. Requests fail intermittently, latency spikes without warning, memory usage climbs silently, and logs are scattered across multiple processes making it nearly impossible to trace the root cause.

Bridging the Gap Integrating Logs Metrics and Flow for Observability

In this video, we discuss handling both old and new systems in IT environments. From legacy SNMP setups to modern telemetry, most organizations juggle multiple data sources, which can make observability feel overwhelming. We explore how to combine logs, metrics, and flow data into one system that provides actionable insights. You’ll see practical examples of simplifying scattered tools and making sense of complex, disparate information. Understanding how these different types of data work together is key to getting observability right.

Why Doctors Now Recommend Wearable Technology for Elderly Parents

Wearable technology for elderly parents is becoming increasingly vital as we face an unprecedented demographic shift. By 2050, the global population of individuals aged 65 and older is projected to reach nearly 1.5 billion. This number has been projected to rise from 46 million in 2016 to over 98 million by 2060 in the United States alone. These statistics highlight why we need innovative solutions for senior care.

Smoother, smarter observability with the updated Site24x7 iOS 26

Enjoy improved control, clarity, and communication using the Site24x7 app on iOS 26. This update blends Apple's dynamic liquid glass design language with fast, secure, on-device AI summaries that help you observe your IT stack instantly and act decisively, from anywhere.

10 Most Common Network Devices & How to Monitor Them

When it comes to running a reliable IT infrastructure, network devices are at the center stage. They sit quietly in the background (routing packets, securing traffic, and keeping teams connected) until something goes wrong. The truth is, without them, nothing works. Every network device has a specific role: some connect users, some protect data, others balance traffic or bridge different protocols.

Background Job Observability Beyond the Queue

Background jobs handle the critical work that happens outside the request path: processing payments, sending emails, generating reports, syncing data. They keep applications running smoothly, but the signals they produce look different from API endpoints. Most teams start with queue metrics—how many jobs are waiting and how quickly they complete. These metrics provide the foundation, but job health extends beyond throughput.

What is Database Monitoring? A Guide for Developers, DevOps, and SREs

Databases handle critical operations for applications, from online banking to e-commerce and streaming services. Any slowdown or failure can directly affect application performance and user experience. Database monitoring tracks performance, detects issues, and helps prevent downtime. It also ensures efficient use of resources, maintains security, and supports compliance requirements.

Hand Code or no Code, Scout Keeps Error Monitoring Out of the Log Mess

You’ve built something, it’s live, and users are starting to show up. Maybe you programmed it from scratch, used a tool, or vibe-coded it into existence. No matter how it came to be, the fact that you’ve got users is great! But here’s a question every new developer must eventually ask: how do I know when my site is actually failing? ‍ The thing is, these failures aren’t always obvious.

How to connect ServiceNow to Grafana Cloud IRM incidents

Companies rely on a variety of services to streamline their workflows, which often requires data synchronization or information sharing across platforms. But are your tools flexible enough to connect with external systems? ServiceNow is widely recognized for its robust and complex workflow support for enterprises. However, it may not always offer the most intuitive or user-friendly experience when handling incidents.

The Performance Impact of Session Replay Scripts

Session replay vendors love talking about features and pricing, but rarely publish the technical specs that matter most to developers. We analyzed the actual JavaScript bundles and their performance impacts across five major platforms. You know what's wild? Bundle sizes range from 36KB to over 550KB gzipped. That's the difference between imperceptible impact and noticeable slowdown for your users.

Introducing AI Drive: Closing the AI Value Gap

The enterprise is standing at the edge of a seismic shift: an AI revolution. In the next five years, the way work gets done will be fundamentally reshaped as workflows once handled by humans are increasingly replaced or enhanced by artificial intelligence. But here’s the reality: success won’t come from simply handing out Copilot or GPT licenses and hoping employees figure it out.

Reality Bytes: How AI Drive Will Help YOU Win the AI Race (w/ Matt Rose)

In this special Reality Bytes, the team sits down with Matt Rose, Product Marketing Manager at Nexthink, to talk about the just-announced AI Drive. Matt breaks down how AI Drive helps organizations finally make sense of their scattered AI landscape—bringing visibility, usage insights, guidance, and measurable impact into one place inside the Nexthink Infinity platform. He also shares early customer feedback, the future roadmap, and why AI adoption has become a critical competitive differentiator.

LangChain Observability: Monitoring Guide for Production Apps

LangChain applications fail differently than traditional web apps. A single user request can trigger 15+ LLM calls, cost $5 in tokens, and fail silently without throwing errors. One team discovered a $12,000 OpenAI bill caused by a recursive chain with no monitoring. This guide shows how to implement observability for LangChain applications, giving you complete visibility into performance, costs, and errors before they impact your users or budget.

The Personalization Paradox: When Tailored UX Turns "Creepy"

“Stop watching me.” That’s an actual message a user typed into a search bar, captured during session monitoring. They weren’t talking to customer support. They were talking to the algorithm. It sounds absurd until you realize how common this is. When users believe a human is behind your personalization system, attributing consciousness to your automated algorithms, everything changes. Their behavior becomes erratic. Your conversions tank. And nobody talks about it.

Fixing the Reconciliation Gap: Why Order to Cash Breaks Across Industries and How to Close It

Whether you sell consumer goods, ship freight, manufacture vehicles, process payments, underwrite insurance, or manage hospital claims, your business depends on the same thing: order to cash. Orders are created, fulfilled, invoiced, and paid. In principle, it should be simple. In practice, the process is riddled with breaks. Most companies believe they are covered. They run ERP systems like SAP. They use EDI gateways such as Sterling.

You can now choose the frequency of checks

As part of our big deploy that added ping and TCP monitoring, we’ve also shipped a small, but often requested feature: you can now choose the frequency of the check we run. By default, we check your website for uptime every minute. The Lighthouse check runs daily. Using our new feature, you can now, for instance, choose that the uptime check should run every 2 minutes, and the Lighthouse check every 5 days. You can choose the frequency at the settings of the check.

What's Really Happening in Your Branch Office Network?

The great return to the office is in full swing, but the office doesn't look like it used to. Today's enterprise is a fluid entity, with employees collaborating across home offices, corporate headquarters, and geographically dispersed branch locations. This has elevated the branch office from a simple satellite to a critical hub of productivity and innovation.

Website Monitoring by Error Type: DNS, TCP, TLS, and HTTP

When a website goes down, the failure often feels like a black box. Visitors see a spinning wheel, a cryptic error code, or a blank page. For the people responsible for keeping that site online, the first question is always the same: what broke? The truth is that there is no single way a website “goes down.” Instead, a request from a browser passes through multiple steps—DNS resolution, TCP connection, TLS negotiation, and HTTP response. Each step depends on the ones before it.

Microservices Failures and Cascading Outages: Prevention Guide

Microservices architecture offers tremendous benefits for scalability and flexibility, but it also introduces new failure modes that can quickly spiral out of control. When one service fails in a distributed system, the impact can cascade across services like dominoes falling, creating widespread outages that affect your entire application. Understanding how these cascading failures occur and implementing the right defensive patterns is crucial for building resilient microservices.

How External Dependencies Affect SLAs: Managing Third-Party Risk

Modern applications rely heavily on external services to function properly. From payment processors to CDN providers, these external dependencies can significantly impact your ability to meet Service Level Agreements. Understanding how external dependencies affect SLAs is crucial for maintaining reliable services and managing customer expectations.

13 Proven Node.js Monitoring Best Practices You Need

What if your Node.js application suddenly froze during peak hours? Imagine thousands of users trying to log in, make payments, or send messages; instead, they’re stuck waiting. Every second feels like a countdown to frustration, churn, and bad reviews. The truth is, Node.js is powerful but unforgiving. It runs on a single-threaded event loop, meaning just one poorly optimized task or slow dependency can bottleneck your entire app. When performance slips, it affects every customer simultaneously.

Introducing Cost Meter - Proactive Observability Cost Control with Per-Hour Granularity

The irony isn't lost on us - observability platforms are built to be proactive about system health, yet when it comes to managing observability costs themselves, teams are forced to be reactive. Today, that changes with Cost Meter, now live in our platform. Cost Meter transforms observability spend management from a monthly billing surprise into a proactive, data-driven process with hourly aggregated metrics that give you complete visibility into your telemetry ingestion patterns.

What is Service Catalog Observability and How Does It Work?

A service catalog gives teams a shared view of their systems—what services exist, who owns them, how dependencies are structured, and the SLAs that guide expectations. It’s an important part of development infrastructure because it helps everyone speak the same language about services. Service catalog observability builds on that foundation.

How to Ensure Regulation Compliance as a Government Contractor

The government contracting sector is a highly regulated business environment. Entering this sector requires transparency, accountability and expertise. You must also familiarize yourself with regulatory bodies and their standards to boost your reputation in the eyes of federal agencies. Discover how you can ensure compliance with regulations as a new government contractor.

Early Warning Signals now in Webex

We’re happy to announce that Early Warning Signals are now available in Webex! With Webex now supported, Early Warning Signals are available across all chat integrations—including Microsoft Teams, Slack, Google Chat, Discord, Webhooks and now Webex—plus email and SMS. No matter where your team communicates, you’ll never miss the early signs of an outage.

Behind the Dashboard: How to monitor your LLM integrations

Behind the Dashboard is an ongoing series where we look under the hood of a specific Catchpoint feature. Each episode breaks down the technology itself, what’s challenging about using it for monitoring, and how we removed friction and toil to make it a valuable part of the Catchpoint platform. In this episode Leon, Mursi, and Rahul take a look at Catchpoint’s LLM monitoring capabilities, including ensuring your integrated LLMs are up and performing optimally; as well as knowing if you’re using the most effective (accurate) and economical (cheapest per query) option in your suite.

The Cost of Ignoring Expired SSL Certificates for Businesses

SSL certificates secure the digital backbone of businesses. They encrypt data, protect customer trust, and ensure compliance with strict regulations. Yet many companies still face the cost of ignoring expired SSL certificates every year. When a certificate expires, the consequences hit hard: websites go offline, users see security warnings, and revenues drop. Let's break down the risks, costs, and ways to prevent expired SSL certificates from damaging your business.

Introducing Honeycomb Intelligence Canvas

Canvas is an AI-guided workspace inside Honeycomb that combines an AI assistant with an interactive notebook for visualizing query results and traces. You can ask a natural language question about your data and Canvas will immediately start exploring your traces, through multiple queries and other tools, to find the right next steps. Instead of having to write each query yourself, Canvas automatically proposes relational queries, comparisons, and visualizations that explain why an SLO fired or what changed after a deploy.

Detect Email Delays Before They Hit Users - Monitor O365 with eG Enterprise

Email downtime or email delays can significantly disrupt business operations, making proactive monitoring essential to avoid problems. In today’s hybrid work environments, email remains a critical communication channel for customer interactions, internal collaboration, and workflow approvals. Even brief outages or delays in email delivery can lead to missed opportunities, poor customer experience, SLA (Service Level Agreement) breaches and reputational damage.

Driving Customer Success Beyond Deployment

In the rapidly evolving landscape of IT operations, businesses face the constant challenge of staying ahead of emerging technologies and shifting market demands. Implementing new systems or solutions is not just about the initial setup. It is about ensuring long-term success, reducing risk, and unlocking sustained value for the organization. That is exactly what SL360, our comprehensive customer success framework, was designed to deliver.

APM vs Observability: Observing beyond APM

In my previous post I made a bold, sweeping statement that APM is not - in the most specific sense - a subset of observability. Still standing by it I stand by that because words matter and - like many "monitoring engineers" (IT folks who make monitoring and observability their specialty) - I, too, bear scars from the flame-wars on Twitter back in the 2020's where we fought internecine battles over the proper definition of (and number of pillars in) “observability”.

Debugging issues with Sentry's MCP

Turns out, this MCP thing is pretty solid. We've built the MCP server to tap into all the different areas of context within Sentry and make it easy to bring these into your editor client to help debug your application. Want to know the most fixable issues in your environment? Easy. Want to see your query performance for your backend? Just ask it.

Understanding OpenTelemetry Spans in Detail

Debugging errors in distributed systems can be a challenging task, as it involves tracing the flow of operations across numerous microservices. This complexity often leads to difficulties in pinpointing the root cause of performance issues or errors. OpenTelemetry provides instrumentation libraries in most programming languages for tracing.

DevOps Guide to Monitoring in Serverless Applications

Serverless computing helps teams move faster by removing the need to manage servers. Code runs only when needed, scaling up or down automatically. For DevOps engineers, this means quicker deployments and less infrastructure work. But serverless also brings new challenges. Functions run for short periods, making it hard to track errors, performance, and costs.

Breaking Free from SQLite - Why We Added PostgreSQL Support to SigNoz

"Let us support different relational databases apart from SQLite. Nobody likes to run SQLite in production." This was one of the most requested features from our community. Your requests have been heard, and we've added support for different relational databases, starting with PostgreSQL. If you're self-hosting SigNoz, you no longer need to worry about SQLite's limitations. Let's dive into what we've built and why it matters for your production deployments.

Pastries with SREs: Limitless observability and uncompromised donuts

In this episode of Pastries with SREs, we dig into Limitless Observability with a sweet side of unified observability strategy. If you're tired of siloed tools, fractured data, and swivel-chair investigations, this one’s for you. We explore: Why are silos still the norm in modern observability? What’s the true cost of inefficiencies across logs, metrics, and traces? How can SREs, IT operations, and dev teams shift to a no-compromise, unified observability model?

How we used Sentry's User Feedback widget to shape Logs throughout beta

At Sentry, we build in public and we move fast. But moving fast means we don’t always get everything right on the first try. That’s where feedback comes in: it helps us validate what’s working, spot what’s missing, and catch issues we wouldn’t always see through error tracking alone.

Debug, query, and build faster with AI: How we use Grafana Assistant at Grafana Labs

We recently released Grafana Assistant into public preview for Grafana Cloud, and we’ve been excited to see how our customers have already made it part of their daily observability routines. At the same time, Assistant is becoming a go-to companion for developers right here at Grafana Labs, whether they’re debugging on-call issues, helping customers, or trying to remember tricky PromQL syntax.

Meet Canvas: Your AI-guided Workspace Within Honeycomb

Modern systems are wonderfully capable, but relentlessly complex. Debugging across microservices, frontends, and cloud edges often means switching between five or more tools, trying to stitch together “what changed” and “why it broke.” Honeycomb’s wide events model has proven to be a superpower for taming that complexity, by allowing you to easily observe and query end-to-end traces without worrying about how much granular data you attach to your events.

How AISPM Helps Achieve Continuous Cybersecurity Monitoring

Cybersecurity threats evolve at breakneck speed. What worked yesterday might fail tomorrow. Organizations need monitoring systems that never sleep, never blink, and never miss a beat. This is where AI-powered Security Performance Management (AISPM) transforms how we protect digital assets.

Reducing Compliance Gaps with Continuous Monitoring Solutions

Organizations face an increasingly complex web of regulatory requirements that demand strict adherence to security protocols. Among these challenges, maintaining firewall compliance stands as a critical yet often overlooked aspect of cybersecurity strategy. Many companies struggle with compliance gaps that leave them vulnerable to breaches, regulatory penalties, and operational disruptions.

Instrumentation Your Way: Introducing a Combined Splunk AppDynamics Agent

In 2025, microservices are everywhere and Kubernetes is the de facto standard for operating cloud native applications. But not all apps are built in microservices architectures. For most enterprises, hybrid environments are the reality, with their business run on a mix of three-tier and cloud native applications.

Introducing Anomaly Detection: Your Early Warning System for Service Health

Modern engineering teams face a persistent challenge: knowing when something goes wrong before their customers do. With microservices architectures sprawling across dozens or hundreds of services, creating comprehensive alerting becomes an overwhelming task. You're left playing whack-a-mole with manual alert configurations, often missing critical issues or drowning in false positives.

Full-Stack Observability with VictoriaMetrics in the OTel Demo

The OpenTelemetry Astronomy Shop is a widely used demonstration environment designed to illustrate the concepts and practical implementation of observability in distributed systems. Built as a microservice-based e-commerce application, the demo provides developers with a near real-world environment where they can explore how telemetry data—metrics, logs, and traces—can be collected, processed, and visualized.

It broke... lets fix it with Sentry MCP and Seer

Real debugging starts in the editor where you're probably digging through the last commits wondering what random thing changed. Fortunately, you're probably using Sentry and it's going to give you that information. Sentry's MCP is the best way to bring all that context of what broke and how, into your editor so you can fix broken things faster. With Seer, you can bring in the root cause, and solution, and have tools like Cursor or Claude Code go fix it. We'll show you how.

APM for Kubernetes: Monitor Distributed Applications at Scale

When a payment service runs across 12 pods — each serving different customer segments — and an authentication layer spans three namespaces, performance issues can originate in both the application code and the orchestration layer. The challenge is linking request-level performance data with what’s happening inside the cluster: container CPU limits, pod scheduling decisions, and node-level events.

AI Wrote Your Bugs, AI Will Fix Your Bugs

There’s a lot of JavaScript developers these days not actually writing code. They whisper sweet prompts to our AI tools and hope for the best. Is it really any worse than copy-pasting from StackOverflow? Welcome to the era of vibe coding, where understanding your code is optional and “it works on my machine” has evolved into “the AI said it would work.”

The Real ROI of Using an APM Tool for SaaS Businesses

For every SaaS leader, engineer, and operations professional, growth is always the main goal. You’re expected to release features quickly, keep user experiences smooth, and manage everything within a limited budget. But behind the scenes, your application may have hidden issues such as slow performance, unnoticed errors, and laggy transactions that quietly eat away at revenue, reduce customer trust, and exhaust your engineering team.

Prevention Cure: The Happiness Factor in IT Health (w/ EDGE Solutions)

In this episode Tim and Tom sit down with Sean Thomas, Managing Director of EUEM Business at Edge Consulting, to explore how happiness, health, and IT success are all connected. They dive into the parallels between personal health and IT health, the evolution of proactive and preventative IT, and how organizations can better operationalize digital employee experience (DEX). From vendor partnerships and process investments to the role of AI in shaping the next generation of IT management, Sean offers valuable insights for leaders navigating today’s complex digital workplace.

Serverless Monitoring for Modern Industries: Compliance, Scalability, and User Experience

Serverless computing has changed the way developers build and scale applications. With event-driven execution, automatic scaling, and a pay-as-you-go model, it removes the need to manage servers and helps teams move faster. This is why industries like FinTech, e-commerce, and media streaming are adopting serverless at a rapid pace. But serverless also brings new monitoring challenges. Functions are short-lived, run in different places, and are triggered by many types of events.

Data Historians vs. Time Series Databases: A Practical Path Forward

Industrial data strategy often feels like a choice: keep legacy systems or replace them outright. But neither extreme is ideal. Full replacements are disruptive and costly, while avoiding change leaves businesses stuck with tools that limit growth. The better path is incremental. Each organization has different needs, and modernization works best when you build on proven systems while adding new capabilities.

Debugging and logging in Laravel applications

Logic errors, failed HTTP requests, background jobs that ghost silently—software breaks in all kinds of fun ways. The difference between resilient systems and fragile ones isn’t about avoiding errors altogether. It’s about how fast and clearly you can see what went wrong, and fix it. Laravel gives you a solid foundation: structured logging, real-time introspection, and built-in performance monitoring.

Monitor Cloud-Native & Hybrid Apps and Business Transactions With Observability Cloud APM

As organizations modernize, most applications don’t fit neatly into one category—they span both traditional three-tier architectures and cloud-native microservices. To monitor these hybrid environments effectively, teams need APM tools that can seamlessly connect the two worlds.

Introducing Honeycomb Intelligence Anomaly Detection

Modern teams face a persistent challenge: knowing when something goes wrong before their customers do. With architectures sprawling across dozens or hundreds of services, creating comprehensive alerting becomes an overwhelming task. You're left playing whack-a-mole with manual alert configurations, often missing critical issues or drowning in false positives. Today, we're excited to announce our solution to this challenge: Anomaly Detection (currently in alpha), Honeycomb's proactive approach to understanding and acting on service health.

Visualize Logs Alongside Metrics: Complete Observability Elasticsearch Performance

Elasticsearch is a distributed search and analytics engine that powers everything from log management platforms to e-commerce search bars. It excels at indexing and retrieving large volumes of data quickly, but like any complex system it can slow down under heavy load or inefficient queries.

A smarter filter for Grafana Alerting: Introducing a new way to find your alerts

At Grafana Labs, we believe that effective alerting is the cornerstone of any robust observability strategy. That’s why we’re constantly listening to your feedback and working to improve the Grafana user experience so it’s easier for you to manage and interact with your alert rules. Today, we’ve excited to tell you about an update in Grafana Alerting that’s built to address some of your biggest pain points.

What is the Internet Stack... and why should you care?

We talk a lot about the application stack, the code and services you build. However, just as critical is the infrastructure that delivers that code to your users. That’s the Internet Stack: a complex chain of technologies and services, from DNS and BGP to CDNs and ISPs, that every digital experience depends on. It’s separate from your application stack. It’s different for every user, in every geography. And most importantly, it still impacts your users—even if you don’t directly own it.

Monitor Windows Certificate Store with Datadog

The Windows Certificate Store is a critical component of any modern Windows environment. Certificates enable TLS encryption for Internet Information Services (IIS)-hosted applications, support certificate-based authentication in Active Directory, and help validate the identity of trusted Windows services. But if a certificate in your store expires, is revoked, or is part of a broken certificate chain, you risk instability and security gaps in your Windows environment.

Query Builder v5 - Two Years of Technical Debt, 80 Closed Issues, and a Fundamental Rethinking

In 2022, we had three different query interfaces. Logs had a custom search syntax with no autocomplete. Traces only had predefined filters - no query builder at all. Metrics had a raw PromQL input box where you'd paste queries from somewhere else and hope they worked. Each system spoke a different language. An engineer debugging a production issue had to context-switch not just between data types, but between entirely different mental models of how to query data.

Visually identify observability gaps with Cloudcraft in Datadog

Modern cloud environments are highly complex and dynamic, with critical services relying on large numbers of ephemeral resources. Ensuring observability coverage across this landscape is essential for troubleshooting, maintaining reliability, optimizing performance, and enforcing security standards. But as environments grow more elaborate and their ownership more dispersed, tracking observability coverage becomes increasingly challenging.

Logs vs. Metrics: Why You Need Both for Observability

Picture this: Your dashboards are calm. CPU load is steady. Error rates are low. Everything looks fine. That is, until the alarms go off. Now what? Metrics tell you something’s wrong, but not what, where, or why. They reveal symptoms, not root causes, and in high-stakes environments, that’s only half the story. Say your API response times spike. Metrics raise the flag, but they don’t tell you if it’s a code deployment, a database hang, or a traffic surge.

Why Real-Time Network Monitoring Is Critical for Modern Business Resilience

Business operations are more interconnected and technology-driven than ever before. Networks form the backbone of communication, data exchange, and service delivery. A single failure can disrupt productivity, weaken customer trust, and even result in financial losses.

MetrixInsight Alerting Beyond Citrix Director/Monitor

MetrixInsight for Citrix VAD/DaaS surpasses Citrix Director/Monitor in many ways. Director/Monitor is useful for day-to-day visibility, but it is not a true enterprise monitoring platform. MetrixInsight adds much more, also when it comes to alerting and enterprise monitoring. It closes important gaps that can directly impact user experience when capacity or performance issues slip through unnoticed. One clear example is how operational VDA capacity is handled.

Subsea Cables Parted in Red Sea Again

This past weekend saw the latest round of submarine cable cuts to impact internet connectivity between Europe and Asia. And once again they took place in the Red Sea, an historic problem area for subsea cables. In this post, I review some of the impacts that we observed in both the loss of transit in affected countries as well as increased latencies between public cloud regions using Kentik’s Cloud Latency Map.

Why it's time to move beyond APM: Monitoring from the user's perspective

For years, organizations have relied on Application Performance Monitoring (APM) as the backbone of their observability strategy. The idea was simple: collect as many logs, metrics, and traces as possible, then sift through the data to uncover insights. But as applications have shifted to the cloud and become increasingly API-driven, that model has broken down.

Introducing Honeycomb Intelligence MCP Server - Now GA!

In the months since we launched our public beta, we’ve been hard at work making Honeycomb MCP more useful and capable for agents and human operators alike. Our goal with this project has been, from the start, to allow AI to engage in the same kind of investigatory loops that we guide users towards. Many of the new features are designed expressly with this in mind, the most exciting of which is BubbleUp, now available in.

Observability and Monitoring Governance (Part 1 of 4)

In contrast to the many flavors of governance used for IT, such as data governance, audit and compliance, and governance and security, IT monitoring governance lacks a definition in many organizations. This is true even as teams have decades of experience monitoring the health, performance, and availability of applications, infrastructures, networks, and user experience. Good monitoring governance “just sort of happens—naturally, organically.” Not exactly!

Custom OpenTelemetry Collectors: Build, Run, and Manage at Scale

I tried thinking back to when the last time I read an actual tutorial that did not include a bunch of em (—) dashes, semicolons, normal dashes, and an unnervingly large quantity of the phrases like “XYZ-thing Alert ” and “Exciting News!”. Well, hold on to your suspenders folks, here we go again. Part 2 is up and it’s a controversial one.

Introducing Event iQ: Smarter Event Correlation in Splunk IT Service Intelligence (ITSI)

Every day, IT teams are flooded with alerts—thousands of messages about performance issues, service outages, or suspicious activity. With so many notifications, it’s easy to get overwhelmed, miss critical problems, or waste time chasing false alarms. Correlating related alerts into groups can help reduce the noise and make sense of everything, but setting up those correlations takes time, experience, and a lot of both system and historic knowledge.

The Answer to SRE Agent Failures: Context Engineering

AI agents for SREs were supposed to slash mean time to resolution and eliminate alert fatigue. Instead, most teams got expensive, unreliable tools that burn through tokens without delivering insights. But what if the problem isn't the AI models themselves? Recent benchmarking reveals the real bottleneck: context engineering. When we tested our context engineering approach against conventional methods, the results were dramatic: Scroll down for our benchmark results to see the full comparison.

4 Ways AppNeta Enhances Cost-Focused Cloud Planning

Enterprises are hemorrhaging their cloud budget: respondents in 49% of organizations estimate their cloud spending is wasted due to unchecked provisioning and lack of predictive cost governance. This cost inefficiency stems not just from financial blind spots, but also from operational gaps: poor visibility into network reliability and user experience. Real-time, end-to-end visibility is the foundation of cloud optimization.

Monitor the Health, Performance, and Security of Your AI Application Stack with AI Agent and AI Infrastructure Monitoring

At this year’s.conf25, we introduced an exciting new chapter in observability at Splunk — one that is unified, AI-powered, and agentic — to ensure ITOps and engineering teams are digitally resilient in the AI era.

DX UIM Hub Interconnectivity and the Benefits of Static Hubs

In DX Unified Infrastructure Management (DX UIM), there are multiple elements that need to work in harmony to achieve a high level of observability. Understanding the architecture of DX UIM can help you make configuration decisions that minimize resource consumption, without sacrificing the volume and granularity of observability data collected. In addition, using static hubs is a simple and particularly powerful option for specific situations.

Interactive Dashboards - Click Any Panel to Start Debugging

Your dashboard shows a latency spike. To investigate it, you copy the query, open logs in a new tab, paste and modify the query, lose your dashboard filters, and repeat for traces. By the time you find the issue, you have 15 tabs open. Starting today, you can click any panel and investigate right there. All your filters and variables carry over. No more tab juggling.

Measuring service response time and latency: How to perform a TCP check in Grafana Cloud Synthetic Monitoring

When your database stops accepting connections or your mail server becomes unreachable during business hours, the impact is immediate and costly. Fortunately, the right monitoring strategy can help you detect these TCP connection failures early on, and prevent them from impacting the user experience.

Honeycomb MCP Is Now In GA With Support for BubbleUp, Heatmaps, and Histograms

If you’ve been following my public journey with LLMs this year, it probably won’t surprise you to learn that this blog post is an announcement about the general availability of Honeycomb’s hosted MCP server. I want to share a few updates about what’s new in the GA release, discuss some interesting learnings from building it, and share examples of how we’re using MCP internally. First: if you're still in the dark about MCP and AI agents, go read the earlier blogs I linked.

Early Warning Signals now in Discord

We’re rolling out Early Warning Signals to yet another place your team works every day: Discord. With this release, nearly all of our chat integrations now deliver Early Warning Signals—bringing you proactive outage alerts no matter where your team collaborates. Already available by email, SMS, Slack, Microsoft Teams, Google Chat, and webhooks, Early Warning Signals are now live in Discord too—closing the gap and making sure your team is covered wherever you communicate.

How to Use AI for Operational Excellence

Organizations are under immense pressure to do more with less – streamline operations, reduce costs, all whilst improving both the outcomes of the business and their employees. For IT and end-user computing (EUC) professionals, this challenge is especially prevalent. Systems are becoming increasingly complex, the digital employee experience is now directly tied to customer satisfaction, and the role of technology teams extends much further than solely keeping the lights on.

Broadcom Recognized as a Leader: Engineering the Future of Service Orchestration

In our digitally transforming world, the pace of change is relentless. Businesses are tasked with managing increasingly complex hybrid environments, from core mainframes to dynamic cloud services. The pressure is on, not only to keep the lights on, but to innovate faster, deliver flawless services, and fuel business growth. In this high-stakes environment, service orchestration and automation platforms are no longer just a tool—they are the central nervous system of the modern enterprise.

Crash reporting for gaming consoles is now Generally Available

TL;DR: Error monitoring and crash reporting for all major gaming consoles is now generally available (plus, the v1.1 of our Unreal Engine SDK). Already convinced? Jump to the ‘What’s In The Release?’ section. Over a decade ago, a customer hacked Sentry into their PlayStation 3 games. Fast forward to today, Sentry now supports thousands of game developers across web, mobile, and desktop. The missing piece? Consoles. Developers asked for it. We built it.

Serverless Monitoring Made Simple: Challenges and Solutions with Atatus

Serverless computing has revolutionized the way applications are built and deployed by eliminating infrastructure management and enabling automatic scaling. However, the dynamic and distributed nature of serverless architectures presents unique monitoring challenges that can impact application performance and user experience.

How PHP Monitoring Handles Response Times?

Every millisecond matters when users interact with your PHP application. If a page lags or a request takes too long, most people will leave without a second thought. For DevOps teams, these slowdowns are frustrating because the root cause is rarely obvious. Developers are left combing through logs and traces, often realizing too late that poor response times are already hurting user trust and business outcomes. The pain point: slow PHP response times frustrate users and create hidden costs for teams.

Monitoring Claude Code Usage with OpenTelemetry and SigNoz

In this video, we’ll walk you through how to monitor Claude code activity using OpenTelemetry and SigNoz. You’ll learn how to instrument your usage, capture telemetry data, and visualize it with SigNoz to get better insights into your system performance. Whether you’re exploring observability for AI workloads or looking for an open-source solution to monitor your llm activity, this guide will help you get started.

Interactive Dashboards | SigNoz Launch Week 5.0 | Day 1

Interactive Dashboards eliminate the current workflow of opening new tabs and manually recreating queries every time you need to investigate a spike or anomaly. Click directly on any data point to drill down and explore. ​What you can do: ​Built for developers who need to debug production issues efficiently, not juggle with multiple tabs.

Observability Journey Panel - Dell x TekStream

Join Dell Technologies, TekStream Solutions, and Grafana Labs for a candid panel on scalining observability. Learn how enterprise teams scale observability, balance centralized vs. decentralized models, and accelerate adoption. The panel explores challenges with culture, governance, tool sprawl, and how AI is reshaping monitoring and incident response.

Managing access in Grafana: a single stack journey with teams, roles, and real-world patterns

When multiple teams use Grafana, it can start to feel a bit messy. Dashboards pile up, permissions become unclear, and teams accidentally overwrite each other’s work. To help you and your organization stay clear, collaborative, and secure, we recommend putting all users in a single Grafana Cloud stack and managing access with teams, roles, and folders. To illustrate this, I’ll share a hypothetical example of how you can put this into practice across three teams. Let’s dive in!

A practical guide to error handling in Go

When you first start coding in Go, you quickly learn how error handling in the language differs from error handling in languages such as Java, Python, JavaScript, or Ruby. In those languages, throwing an exception automatically generates a stack trace. Go, by contrast, provides no built-in error tracing to reveal an error’s origin.

Beyond Wearables: How Remote Patient Monitoring Shapes Care Delivery

The healthcare industry has entered an era where the boundaries between in-person visits and digital health interactions are becoming increasingly blurred. Remote Patient Monitoring (RPM) sits at the forefront of this transformation, enabling physicians to track, analyze, and respond to patient health data in real time. While wearables such as smartwatches laid the groundwork, RPM has moved beyond step counts and heart-rate alerts. It now serves as a critical pillar of value-based care, reshaping how medical professionals manage chronic conditions, enhance patient engagement, and improve clinical outcomes.

What Are Buckets in Elasticsearch? (Explained in 60 Seconds)

Overwhelmed by raw data? In this short video, we demonstrate how Elasticsearch utilizes buckets to group and organize data by time, value, region, or any other shared trait. Whether you're tracking error codes or hourly sales trends, buckets and nested aggregations help turn chaos into clarity. Additionally, discover how time-based bucketing enables you to spot patterns and zoom in on valuable insights quickly.

Synthetic Monitoring for Vibe Coded Apps: Why You Need It

Not all software is the product of rigid planning, extensive documentation, and carefully designed test pipelines. Some of it emerges in bursts of intuition, created by small teams or individuals who prioritize momentum over process. This is what many engineers call vibe coding: development driven by flow and creativity, where the goal is to get something working quickly rather than ensuring every edge case is accounted for.

How to Transform Telemetry Data with the OpenTelemetry Transformation Language

This demonstration shows how to use the OpenTelemetry Transformation Language (OTTL) to transform, filter, and enrich telemetry in the OpenTelemetry Collector without changing application code. We walk through a sample Python application and OpenTelemetry configuration file, generate real traffic, and then analyze the results in Splunk Observability Cloud.

Kubernetes Monitoring Metrics That Improve Cluster Reliability

A Kubernetes cluster can generate more than 1,400 metrics out of the box. That’s a lot of numbers to sift through, especially when you’re troubleshooting a production slowdown in the middle of the night. The key is knowing which metrics tell you the most, with the least noise. These are the signals worth paying attention to when you need answers fast.

Understanding dbt: basics and best practices

Data Build Tool (dbt) is an open source analytics engineering framework that enables teams to transform raw data that has been loaded into a warehouse like Snowflake, BigQuery, Redshift, or Databricks using SQL-based workflows. dbt is available in two main forms: dbt Core, the free and open source CLI tool, and dbt Cloud, a managed platform that adds scheduling, UI support, collaboration tools, and native integrations.

How to Improve MariaDB Performance: Track Slow Queries with Logs and Metrics

Database latency rarely starts in your app layer because it’s almost always a query doing more work than it should. Metrics tell you when that happens, but slow-query logging tells you which statement did it and how. That’s gold for tracking down missing indexes, inefficient filters, or accidental full scans. Pair the logging with a some lightweight counter metrics, and you get both an early warning and a clear path to a fix.

Empowering an MCP server with a telemetry pipeline

This blog was authored by Jason Bloomberg, Managing Director, Intellyx BV ‍ Observability depends upon telemetry – the data streaming from various applications, services, and systems that indicate their internal state in real-time. Various tools consume such telemetry to enable both operational and cybersecurity tasks.

Tiger teams: How we tackle urgent, cross-functional challenges at Grafana Labs

A year ago, we hit a wall. Our Grafana OSS releases were excruciating to execute. The process was confusing and hard to follow, security patches were non-trivial, and many engineering hours were lost to an overly manual process. We needed to move fast, cut through ambiguity, and pull in just the right people without waiting on roadmaps or org charts.

Landing Page Monitoring: Why, When and How to Do It Right

Landing pages are the lifeblood of modern marketing campaigns. They’re not the homepage, not the product catalog, not the blog—they’re the sharp end of the funnel where traffic from ads, emails, and social clicks is supposed to turn into revenue. A landing page is where a $50,000 media buy either pays off or evaporates.

How Teams Are Using AI to Tackle Observability Challenges (2025 Survey Insights) | Grafana Labs

In Grafana’s 3rd annual Observability Survey, over 1,000 engineers and leaders shared their challenges — tool sprawl, complexity, rising costs, and nonstop alerts — and their hopes for how AI can help.

Introducing the StatusPage.io Import Tool: Migrate Your Incident History to Hyperping in Minutes

Switching status page providers shouldn't mean losing years of valuable incident history. Your service timeline tells the story of your reliability journey—outages you've overcome, maintenance windows you've scheduled, and the trust you've built with transparent communication. Yet most migrations force you to choose: start fresh with a clean slate or manually recreate years of historical data.

What If You Could Roll Back Any Network Change in Seconds?

If you’ve worked in network operations, this scenario is all too familiar. Even the most seasoned teams and robust processes can’t escape reality: changes fail, misconfigurations happen, and the fallout is real–lost productivity, unhappy customers, compliance headaches, and hours (or days) of cleanup. But what if it didn’t have to be that way?

Database Performance Analyzer Overview

The cross-platform solution for performance monitoring for both cloud and on-premises databases. Anomaly detection powered by machine learning combined with forensic-level wait-time analysis gives you the power to diagnose performance issues in a matter of minutes, not days. Both real-time and historical data give you down-to-the-second answers to resolve critical problems, while expert advice via query and table tuning advisors allow you to proactively optimize your enterprise.

Getting Started with SolarWinds Network Topology Mapper

The video provides a quick walkthrough of the SolarWinds Network Topology Mapper, starting with the welcome screen that prompts users to run a new scan. It guides viewers through a wizard for adding SNMP credentials, emphasizing the importance of specifying IP addresses for accurate mapping. The tool can identify unmanaged switches and allows for both one-time scans and scheduled updates, maintaining a clean map by archiving previous scans. Users can hover over connections to view port numbers, although serial numbers remain unavailable. The presenter encourages viewers to reach out with any questions.

The Next Evolution of AI: Forget Smarter Models - It's All About the Data

It’s been a noisy summer in the AI world. Headlines have been filled with doom and gloom: For example, OpenAI’s ChatGPT-5 landing with a thud, and an MIT report claiming 95% of AI pilots are failing. For the sceptics, this is “proof” that AI is just hype. I don’t buy it. The MIT study looked at just 50 projects, a sample size so small you’d fail a basic stats exam for using it. And as someone who uses AI every single day, I can tell you the benefits are real.

SvelteKit observability just got 10x better, and we're here for it

The Svelte Team recently announced full observability and tracing support for SvelteKit! This is great news for SvelteKit and Sentry users, since Sentry is already compatible with the new feature! In addition, this is even greater news for the JavaScript ecosystem as a whole because SvelteKit just became the first ESM-based meta-framework to support instrumentation and tracing out of the box.

The Public Internet Is Not Your WAN

Within many organizations, there’s been a strategic imperative to abandon MPLS in favor of SD-WAN and direct internet access, particularly when it comes to branch office connectivity. The benefits of this move are undeniable and compelling. Organizations can establish direct cloud connectivity and realize cost savings and improved agility.

What Are Vector Embeddings? (Explained in 2 Minutes)

In under 2 minutes, we explain what vector embeddings are, how they work, and how to use them in real-world applications like text expansion. We'll also show how Elasticsearch supports vector search with two powerful models: E5, open-source text embedding models designed for multilingual search, and ELSER, a sparse embeddings model from Elastic.

Full Session Simulation - Simulate Anything, Everything, Anywhere

Full Session Simulation is a powerful troubleshooting strategy. Have you ever been in a situation where everything on your dashboards looks green, but users are still encountering issues and raising support tickets?The cliche of “everything is fine on our side” moment is not just frustrating for everyone. It’s risky! Because when you can’t replicate what the user is experiencing, you’re flying blind.

What is Infrastructure Monitoring? How it Works, Key Metrics & Use Cases

Infrastructure monitoring is the process of continuously collecting, analyzing, and visualizing data from an organization’s IT infrastructure. With infrastructure monitoring, DevOps teams can maintain system health, meet SLAs, reduce downtime, and detect and resolve issues proactively. This ensures optimal performance, availability, and reliability. Key networks components infrastructure monitoring typically covers.

Sharpening My React Hooks Knowledge With ChatGPT

I’m a product engineer at Honeycomb. While my work spans the stack, I’m currently focused on deepening my frontend expertise. To support this, I’ve been using ChatGPT as a study assistant. It’s helped me break down complex topics with clear explanations, real-world examples, and—critically—interactive practice. The most effective formats I’ve found.

What's new in the Infinity data source for Grafana: support for JQ parser, additional HTTP methods, and more

Since its launch in 2020, the Infinity data source for Grafana has become the go-to solution to seamlessly query and visualize data from JSON, CSV, XML, and GraphQL endpoints within Grafana. Allowing users to integrate diverse data formats via HTTP-based APIs, the Infinity data source has enabled a wide range of use cases within our community over the years — from visualizing cloud computing costs to popular Pokémon games.

Visually identify and prioritize security risks using Cloudcraft

As cloud infrastructure becomes more dynamic and distributed, DevOps and security teams need to quickly detect risks and understand their context: where those risks live, how critical they are, and how to respond effectively. By surfacing misconfigurations, vulnerabilities, sensitive data risks, and identity threats directly on a real-time diagram of your infrastructure, Cloudcraft helps teams identify, prioritize, and remediate security issues at scale.

Transform your public sector organization with embedded GenAI from Elastic on AWS

Elastic featured in AWS Generative AI Hub for public sector Elastic is proud to be featured in the new AWS Generative AI Content Hub for public sector — a destination showcasing the most impactful ways agencies can securely adopt and scale generative AI (GenAI).

When metrics mislead: Inside the 2025 Retail Web Performance Benchmark

Over the past few years at Catchpoint, we’ve benchmarked the digital performance of banks, airlines, hotels, travel aggregators, GenAI platforms, athletic footwear brands, and even ad hoc events like the Super Bowl, Olympics, and Election Day. Each time, our approach focused on the technical metrics performance professionals live and breathe: DNS resolution times, Time to First Byte, page load speeds, and six other core measurements that we'd dissect, analyze, and use to rank companies.

The Role of Service Maps in Optimizing PHP Application Performance

Modern PHP applications rarely exist in isolation. They run across distributed environments, connect to MySQL or PostgreSQL databases, interact with Redis or Memcached, rely on APIs, and communicate with microservices. This interconnected web brings power but also enormous complexity. When performance issues arise, finding the root cause can feel like searching for a needle in a haystack. Is it the database? A caching layer? A failing third-party API?

How to Reduce Serverless Costs with Smart Monitoring

Serverless architecture has changed how applications are built and run. It removes the need to manage servers, letting developers focus on writing code while automatically scaling with demand. But even with its pay-as-you-go model, serverless apps can get expensive if not monitored and optimized. In this blog, lets see how smart serverless monitoring helps developers and DevOps engineers lower serverless costs, boost performance, and keep operations running smoothly.

The Fourth Pillar of Observability

Your application is only as reliable as the infrastructure it runs on. Most commonly, that means Kubernetes is doing the job by managing fleets of containers, scaling services on demand, and keeping workloads distributed across nodes. Traditional dashboards weren’t built to scale with this reality. They give you snapshots of raw metrics. They don’t scale to multi-cluster environments. They don’t map relationships between resources.

Introducing Kentik Traffic Costs: Real-Time Network Cost Intelligence

Introducing Kentik Traffic Costs, an industry-first automated workflow delivering instant cost estimates for network traffic slices. Learn how this exciting new feature gives network, financial, and sales teams actionable insights to optimize spend, improve margins, and drive revenue.

Bringing Observability to Claude Code: OpenTelemetry in Action

AI coding assistants like Claude Code are becoming core parts of modern development workflows. But as with any powerful tool, the question quickly arises: how do we measure and monitor its usage? Without proper visibility, it’s hard to understand adoption, performance, and the real value Claude brings to engineering teams. For leaders and platform engineers, that lack of observability can mean flying blind when it comes to understanding ROI, productivity gains, or system reliability.

Top 3 Jira reporting tools: SquaredUp vs Power BI vs Jira

A recent survey revealed that developers and engineering teams waste 8+ hours a week on inefficiencies in their role. Poor reporting tools are a main contributor, with Jira being regarded as a frequent source of friction. But since Jira is so deeply embedded in most organizations' infrastructure and processes, replacing it is not really an option. Rather, the solution lies in optimizing how users interact with it rather than abandoning it altogether.

Cutting through Kubernetes Complexity with Lumigo

Effectively monitoring Kubernetes environments remains one of the most challenging aspects of modern application management. As applications grow more complex and distributed, the need for comprehensive visibility becomes paramount. We have continued to deliver major advancements in our Kubernetes monitoring, providing you with deeper insights and more powerful tools to tackle these challenges head-on.

The New Physics of IT: Service-Centric Observability, AI-Driven Operations, and Intelligent Automation

Why the traditional model of monitoring and manual operations is collapsing–and what enterprises must do to survive The digital universe is expanding at a pace no enterprise can keep up with through traditional methods. Dependencies pull at each other in ways even experts can’t predict. What once could be managed with dashboards and siloed monitoring tools has become too vast, too interdependent, and too fast-moving, a new operating model is needed to master such complexity.

Why database monitoring is critical for application performance

When an application slows down, users rarely think about the database—but in many cases, that’s where the bottleneck lies. Databases sit at the core of nearly every application, storing, retrieving, and processing the information that powers business transactions, analytics, and user interactions. A minor inefficiency in query execution or a spike in resource usage can cascade into multiple issues, starting with degraded application performance, service interruptions, or even downtime.

This Month in Datadog - August 2025

In the August episode of This Month in Datadog, Jeremy shares how you can make more informed cloud cost decisions, gain insights into your LiteLLM-powered applications, and secure Kubernetes infrastructure with Datadog Workload Protection. Later in the episode, Danny puts the spotlight on Datadog Kubernetes Autoscaling, which helps you deliver cost savings without sacrificing performance.

Weaving AppNeta Experience Insights into DX NetOps: A Step-by-Step Guide

Today’s enterprise networks aren’t constrained to a single location—they span continents, clouds, and providers, and they’re relied upon by users who can work from anywhere. For network operations teams, that means every issue is a potential scavenger hunt. Is it the app? The WAN? The cloud provider? The ISP? The stakes are high and your tools need to evolve. That’s why the integration of DX NetOps and AppNeta is such a game-changer.

AWS metric ingestion for less: Save money and get near real-time stream into Grafana Cloud

There’s a new way to ingest AWS metrics into Grafana Cloud that makes observing your AWS resources more cost-effective, easier to operate, and more accurate. You can now stream metrics into the AWS Observability app in Grafana Cloud in near real-time thanks to our new integration with Amazon CloudWatch and Amazon Data Firehose. We’re already using it internally, and we’re finding that it’s not only easier to operate—it’s at least five times more cost-effective.

Visualize Logs Alongside Metrics: Complete Observability for Slow MongoDB Operations

MongoDB’s strength of flexible schema and fast iteration can also hide costly queries until they surface as user-facing latency, replica lag, or spiky CPU. A handful of slow operations can impact the cache, starve other workloads, and cascade into timeouts across services. Monitoring slow queries gives you an early warning system for index gaps and query-plan regressions introduced by code deploys, schema changes, or shifting data shapes.

The hidden costs of shadow AI: CPU drain, data risk, and network bottlenecks

The risk of headline-grabbing incidents, like Samsung’s ChatGPT data leak, related to AI usage outside of the authorization and control of IT (a.k.a. shadow AI) is clear. Most IT teams recognize that a high-profile incident can have serious repercussions. However, the risk of shadow AI goes well beyond the risk of a single incident. In fact, the recent Komprise IT Survey indicates that 79% of organizations have experienced negative outcomes from sending corporate data to AI.

Dashboards say green. Users say It's broken.

Your infrastructure metrics are all green. The code is clean. But support tickets are rolling in. What’s going on? The problem: traditional monitoring tools stop at your infrastructure. They don’t tell you if the user can actually complete their task. As @gerardo explains, the objective of a car is not to have the correct tire pressure or gas levels... it’s to get from point A to point B. User experience works the same way. What’s the point of having green metrics when your users are not experiencing the same thing?

Kentik Traffic Costs Workflow Demo

Learn how Kentik's automated traffic cost workflow provides instant visibility into network traffic costs, enabling you to optimize spend, improve margins, and make smarter business decisions. In this demo, you'll see practical examples like evaluating costs by AS group and downstream customer, helping network, finance, and commercial teams take immediate, actionable steps to reduce costs and boost efficiency.

Bridging the Gap: Legacy Systems and Modern Observability

Technology moves quickly and while the spotlight has shifted to dynamic, cloud-based systems, many organizations have legacy applications and infrastructure that they must maintain. In this fireside chat, Datadog’s Matt Moore (Principal Observability Strategist) will host James Flores (Enterprise Systems Engineer) at Australian Community Media to discuss their journey of modernization and bridging legacy systems with the cloud using a bit of ingenuity and observability.

Logs are Generally Available (Still logs, just finally useful)

When we started building Logs in Sentry we had one goal: make them useful for real debugging, not just another high-volume text storage. This meant making them "trace connected" from day one. This let us ensure they were tightly connected to the actions and performance happening in your application, right where developers already go to investigate errors, performance, and latency issues. Now, Logs is out of beta and generally available to everyone.

Cost Controls and so Much More: Issue Detection Through Usage Analysis

Keeping tabs on cloud spending across multiple organizations and vendors, including Datadog, can be tough and costly. If you're not tracking expenses, you're also missing other critical insights. The Flight Centre Travel Group (FCTG) faced this when moving to Datadog, needing to monitor costs across numerous organizations and over 180 Azure subscriptions. After a rapid migration, new cost reports quickly revealed more than just financial benefits. Unusual spending patterns often highlighted incidents, bugs, or security issues, offering early warnings about internal system problems.

What is APM Tracing?

APM tracing records the complete execution path of a request as it travels through your system, including database queries, external API calls, cache lookups, message queue events, and inter-service requests. Each step is captured with precise start and end timestamps, duration, and context such as service name, operation name, and relevant attributes. This lets you pinpoint where latency or errors originate without piecing together metrics and logs manually.

August Early Warning Signals: detected before providers

In August, StatusGator’s Early Warning Signals detected hundreds of global service outages before official provider acknowledgments were published. Our alerts notified users early on—often minutes before providers confirmed issues—giving IT teams the critical lead time to respond. Below, we highlight three of the most significant outages we tracked in August, followed by a curated selection of other notable disruptions.

Vendor consolidation-the key to IT cost optimization in 2026

IT departments are no strangers to complexity. With businesses navigating a range of cloud services, cybersecurity tools, and automation technologies, the modern IT ecosystem can resemble a medley of vendors, software, and services sown together. While these solutions aim to improve performance, the sheer volume of suppliers can muddy the waters when it comes to efficiency and cost management.

Database monitoring for beginners

Understand what's happening inside your database before your users do. Modern applications live and breathe through their databases. But when slow queries, connection spikes, or failed transactions start to pile up, the impact isn't just technical—it's customer-facing. That's why tracking your databases gives you the visibility into how your databases are performing under the hood.

Actionable insights into the end-user experience: an overview of Grafana Cloud Frontend Observability dashboards

One of the biggest challenges in frontend development is identifying when and why users encounter performance issues, whether it’s slow page loads, JavaScript errors, or failed HTTP requests. With Grafana Cloud Frontend Observability — a hosted service for real user monitoring (RUM) — you get immediate, clear, and actionable insights into the end-user experience of your web applications.

Serverless Monitoring: Essential Metrics Every Developer Should Track

Serverless applications have become one of the most efficient ways to build and deploy software. With platforms like AWS Lambda, Azure Functions, and Google Cloud Functions, teams can focus on writing code while the provider handles infrastructure, scaling, and availability. But going serverless doesn’t mean monitoring stops being important. In fact, monitoring becomes even more critical because you don’t have direct control over the servers, containers, or VMs.

Azure Data Factory Monitoring Integration

Microsoft Azure Data Factory is a cloud-based data integration service provided by Microsoft Azure. It enables you to create, manage, and automate data workflows that move and transform data from different sources to various destinations. Essentially, ADF allows you to design, orchestrate, and manage data pipelines, making it easier to work with large volumes of data across on-premises and cloud environments.

Netdata AI Troubleshooting is Now Generally Available with On-Demand Credits

Since launching our AI investigations and insights in a research preview, one thing has become clear: automated root cause analysis delivers a significant return on investment. Teams have confirmed that instant insights don’t just save a few minutes; they fundamentally shorten incident response cycles, free up valuable engineering hours, and reduce the business impact of downtime.

The Debugging Bottleneck: A Manual Log-Sifting Expedition

Imagine a developer at a fast-growing company. A customer support agent reports a critical issue: a user's recent order is stuck in a "pending" state. The agent provides a customer ID and a request ID. The developer's typical process is a familiar, painful dance: This process is slow, tedious, and prone to human error. The Mean Time to Resolution (MTTR) is measured in hours, not minutes, and it's a huge drain on engineering resources.

Your Network Disaster Recovery Plan is Only as Good as its Execution

A disaster recovery plan (DRP) is the strategic backbone of your organization’s resilience. It defines your objectives, outlines responsibilities, and sets the critical promise you make to the business: your recovery time objective (RTO). This plan is indispensable. However, a strategy is worthless without the tactical ability to implement it.

Everything You Ever Wanted to Know About DEXOps (But Were Afraid to Ask)

Reality Bytes is back! In this episode, Tim, Tom, Megan, and Sean dive deep into DEXOps—the practice of operationalizing Digital Employee Experience. Building on insights from the show's recent webinar series, the team explores how IT leaders can shift from reactive firefighting to a proactive, structured approach that drives measurable business value. They cover the key pillars of DEXOps—from people development and process rigor to technology selection, communication strategies, and leadership alignment. You’ll hear why DEXOps isn’t a side project or “hobby,” but a mission-critical discipline, as essential as security or uptime.

What is Single Pane of Glass Monitoring and How Can Enterprises Leverage It for Enhanced Visibility?

Large enterprises today grapple with increasingly complex IT environments - spanning multiple cloud services, hybrid infrastructures and countless applications. Exacerbated by technology silos, the sheer volumes of data generated in such environments can quickly overwhelm IT teams, impairing their ability to identify and respond to customer impacting issues before outages strike.

Top 10 Serverless Monitoring Tools in 2025

Monitoring serverless applications is critical to ensure optimal performance, reduce errors, and maintain end-to-end observability. Choosing the right serverless monitoring tools can help track serverless performance metrics, cold starts, and distributed traces across cloud functions. Below, we explore the top 10 cloud-native and third-party serverless monitoring solutions, highlighting their features, pros, cons, and best use cases.

Technical Blog: Remote Debugging for RTOS Firmware: How Continuous Observability Changes the Game

Debugging embedded software has never been easy, but today’s systems are more complex and interconnected than ever. Real-time operating systems (RTOS) and continuous integration pipelines can make development faster—but certain classes of bugs are hard to reproduce and diagnose. These elusive issues often appear only under rare conditions, such as timing-sensitive race conditions or field-only failures. This is where Continuous Observability, powered by Percepio Detect, changes the game.

A Single Hub for Telemetry: OpenTelemetry Gateway

The OpenTelemetry Gateway (OTel Gateway) is a centralized service that collects, processes, and routes telemetry data—metrics, traces, and logs—across your infrastructure. In a typical setup, each service pushes telemetry directly to an observability backend. While this approach works well for small environments, it becomes increasingly difficult to manage as systems grow.

kubectl logs: How to View & Tail Kubernetes Pod Logs

When debugging containerized applications in Kubernetes, kubectl logs serves as your primary command-line tool for accessing container logs directly. Understanding how to effectively retrieve, filter, and analyze logs becomes essential for maintaining application health and resolving issues quickly, especially in multi-container environments where correlation across services can make or break your troubleshooting efforts.

The Essential Guide to Azure Infrastructure, Monitoring, and Management Tools

Master Azure infrastructure management with this comprehensive guide. Learn the four critical pillars—governance, cost control, security, and operations—and discover the essential native and third-party tools needed to scale your cloud strategy effectively.

Ecommerce Security Incidents: Stripe, Pandora, and OpenCart

Cyberattacks against ecommerce businesses are accelerating, and recent incidents show just how many different angles attackers are exploiting. Whether it’s phishing campaigns, third-party data breaches, or malware injections, ecommerce stores are a prime target. Here are three recent incidents making headlines, and what they mean for ecommerce operators.

AIOps Is Consolidating Fast, Here's Where HEAL Delivers Results

As of September 2025, the Artificial Intelligence for IT Operations (AIOps) market is a rapidly expanding and dynamic sector, projected to surpass $20 billion. The landscape is defined by a major consolidation trend, with large enterprise technology vendors acquiring key AIOps capabilities to integrate into their broader portfolios.

How to Reduce Errors and Improve Reliability in High-Traffic Node.js Applications with APM?

Node.js has become the go-to runtime for building modern, high-performance applications. Its event-driven, non-blocking I/O model makes it particularly well-suited for apps that demand speed and scalability, such as real-time chats, gaming backends, streaming platforms, fintech dashboards, and e-commerce systems. It’s no surprise that some of the world’s largest companies like Netflix, PayPal, LinkedIn, Walmart rely on Node.js to deliver services at scale.