Operations | Monitoring | ITSM | DevOps | Cloud

Complete Guide to Redis Monitoring: Essential Metrics, Tools & Best Practices 2025

Redis is a powerful tool, but its position in the critical path of applications means that performance issues can have a widespread impact. Whether you use Redis as a cache, session store, or primary database, effective monitoring is essential to prevent slowdowns and ensure a responsive user experience. This guide provides a comprehensive walkthrough of Redis monitoring, covering the essential metrics you need to track, the tools available to you, and the best practices to adopt in 2025.

Cloudflare's DNS Downtime: Why BGP Hijacks Were Never to Blame

On July 14, Cloudflare’s popular public DNS service (known as 1.1.1.1) suffered an outage lasting over two hours. As rumors swirled about the cause, we were the first to push back on the theory that a BGP hijack had caused the outage. In fact, the hijack was actually a consequence. How did we know this so early when other internet watchers did not? We’ll discuss in this post.

This Month in Datadog - July 2025

In July’s episode of This Month in Datadog, we’re doing things differently by spotlighting the people behind the products you rely on. Jeremy is joined by Tristan Ratchford to discuss saving time and effort when you’re on call with Bits AI SRE, and by Kevin Hu to explore gaining visibility into datasets across the entire data lifecycle with Data Observability.

Streamlining the Complexity of SD-WAN Deployments With DX NetOps Topology

If you're feeling like your network operations just keep getting more complicated, you're not wrong. One of the core promises of cloud models was improved simplicity. However, the ensuing reality for your network operations teams has been anything but simple. Suddenly, users and applications are everywhere. Traditional, on-premises equipment now coexists with software-defined wide area networks (SD-WANs), cloud-hosted resources, and hybrid connections that hop across public and private networks.

With AI, You're Gonna Have to Manage Your (Massive) Energy Use in SPM

Forget boring spreadsheets. Strategic portfolio management (SPM) isn't just about ticking boxes. It’s the big boss plan that makes sure every penny spent and every project your company starts points towards the main goal. It's your company's smart GPS, guiding you through the AI energy maze. When it comes to AI's power hunger, SPM is a knight in shining armor. It helps leaders get smart, making sure they grab all the fancy tech without trashing the world.

Smarter debugging with Sentry MCP and Cursor

Debugging a production issue with Cursor? Your workflow probably looks like this: Alt-Tab to Sentry, copy error details, switch back to your IDE, paste into Cursor. By the time you’ve context-switched three times, you’ve lost your flow and you’re looking at generic suggestions that don’t show any understanding of your actual production environment or codebase.

Preparing for Infoblox NetMRI End-of-Life: Why Restorepoint is the Ideal Replacement

When a trusted tool like NetMRI reaches its sunset date, it opens the door to modern alternatives that offer more automation, broader integration, and a lower total cost of ownership. You’ve invested time, training, and trust into this solution, and while it may feel like the rug is being pulled out, this is an opportunity to improve how your organization handles network configuration and change management.

OpenTelemetry Distributed Tracing Implementation Guide

Distributed tracing has become essential for understanding the performance and behavior of modern microservices architectures. As applications become more complex with multiple services communicating across different environments, traditional logging and metrics alone are insufficient for debugging performance issues and understanding request flows.

Site24x7 partners with BigPanda agentic IT operations platform to further streamline IT operations

In modern IT management, downtime, performance issues, and alert overload cripple teams, delay resolutions, and frustrate users—a problem solvable with automation and deep integrations that create smoother flow across systems.

Semantic Caching: What We Measured, Why It Matters

Semantic caching promises to make AI systems faster and cheaper by reducing duplicate calls to large language models (LLMs). But what happens when it doesn’t work as expected? We built a test environment to find out. Through a caching system, we evaluated how semantically similar queries would behave. When the cache worked, response times were fast. When it didn’t, things got expensive. In fact, a single semantic cache miss increased latency by more than 2.5x.

Out-of-the-box Alerting for Frontend Observability in Grafana Cloud

Get alerted on frontend issues the moment they happen — no setup headaches required. In this short demo, Elliot Kirk from Grafana Labs introduces out-of-the-box alerting for frontend observability. Whether you're tracking error counts or web vitals, this new feature makes it easy to stay ahead of performance issues. With just a few clicks, you can: Enable prebuilt alerts for your apps Visualize and edit alerts directly in the UI Customize thresholds and durations Set up notifications and stay in the loop Launch alerting with every new app setup.

Kentik Cause Analysis in 60 Seconds

In a world where network traffic can suddenly spike, manually sifting through flow data is often a daunting task. Kentik AI's new Cause Analysis simplifies troubleshooting by quickly identifying changes in traffic by application, IP, ASN, or service. With just a few clicks, Cause Analysis helps you compare time periods, understand traffic shifts, and detect changes in your network. Kentik: Take the hard work out of running your network.

SentinelOne outage: July 10 incident went unacknowledged

July 10, 2025, SentinelOne, a leading cybersecurity platform, experienced a widespread outage that disrupted access to its admin consoles across multiple regions. The incident impacted users in Europe, North America, and beyond, preventing security teams from accessing critical management features. Despite the scale of the disruption, no official public acknowledgment or status update was issued by SentinelOne.

Google Workspace outage: July 18, 2025

Google Workspace went down again in July 2025—but if you had asked AI tools like Google’s own AI Overviews, ChatGPT, or Claude, you would have been told everything was fine. Every one of these tools incorrectly claimed that services were up and running while users across the globe were unable to connect, send messages, or even log in.

Vector Databases Explained: What they are & Why they Matter [Quick Question Ep. 2]

Ever wondered what a vector database is and why it’s becoming so important in AI search? In this quick video, I’ll break down what a vector database is, how it works, and what you should consider when choosing one. About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

Confessions of a CTO: How we Tamed our Cloud Costs

If you’ve ever found yourself staring at a cloud bill that could buy a small island or at least a very nice car, you're not alone. Believe me, at Cribl, we've had our share of those "molotov cocktail" bills that make our CFO, Zach, look like he's about to spontaneously combust. And yeah, a few F-bombs might have dropped from various senior leaders (myself included, I won't lie).

Building a bulletproof network disaster recovery plan

Imagine it’s 2am. A core switch fries because of a sudden power surge. Most of your users wake up to a blank screen. Your team scrambles: Where’s the backup configuration? Who knows the last working state? Hours pass, productivity tanks, support calls flood in, and costs stack up by the minute. This isn’t a theoretical horror story. According to Gartner, the average cost of network downtime still hovers around $5,600 per minute, or over $300,000 per hour.

From Anomaly to Action: ScienceLogic's Role in Accelerating Zero Trust Response

In today’s threat landscape, cyber incidents unfold in seconds, not days. Federal agencies and critical infrastructure operators no longer have the luxury of slow detection or manual triage. As Zero Trust Architecture (ZTA) becomes the new security standard, one principle stands above all: time is risk. The faster an organization can detect, diagnose, and respond to anomalous activity, the greater its resilience. ScienceLogic plays a critical role in making that speed possible.

The Network Impact on Job Completion Time in AI Model Training

In large-scale AI model training, network performance is no longer a supporting actor — it’s center stage. Job Completion Time (JCT), the key metric for measuring training efficiency, is heavily influenced by the network interconnecting thousands of GPUs. In this post, learn why JCT matters, how microbursts and GPU synchronization delays inflate it, and how platforms like Kentik give network engineers the visibility and intelligence they need to keep training jobs on schedule.

Datadog Disaster Recovery mitigates cloud provider outages

A loss in infrastructure and applications observability can leave SRE and DevOps teams without insight into the real-time state of their production systems, causing them to temporarily pause code deployments and limit their ability to troubleshoot issues or respond to critical alerts. In modern cloud environments, where services are distributed and deeply interconnected, this lack of visibility can escalate quickly.

Bring high-performance observability to secure Kubernetes environments with Datadog's new CSI driver

In Kubernetes environments, applications often communicate with the Datadog Agent to send telemetry data such as custom metrics via DogStatsD or traces through Datadog APM. How this communication takes place depends on the communication mode set on the Datadog Cluster Agent's Admission Controller. With the sockets option, communication takes place through local inter-process communication via Unix domain sockets (UDS), whereas the service and default hostip options rely on network communication.

Azure native integration elevates Elastic Cloud Serverless experience

We're thrilled to announce a significant leap forward in making Elastic Cloud Serverless even more accessible and powerful for Azure users. With the general availability (GA) of Elastic Cloud Serverless on Azure, we've just released the Azure native integration for Elastic Cloud Serverless. This builds upon our existing Azure native integration for Elastic Cloud Hosted, allowing users to seamlessly discover and manage Elastic Cloud in a way that feels inherently part of the Azure ecosystem.

Building an Incident Response Playbook: Templates and Examples

An incident response playbook is your team's emergency manual when things go wrong. It's a documented set of procedures that guides your team through detecting, responding to, and resolving incidents efficiently. Without one, teams often scramble during outages, make inconsistent decisions, and take longer to restore service.

What's New in InfluxDB 3.3: Managed Plugins, Explorer Updates, and More

InfluxDB 3.3 is now available for both Core and Enterprise, which introduces new managed plugins for the Processing Engine, making it easier to address common time series tasks with just a plugin. On top of that, 3.3 includes a wide range of performance improvements, feature updates, and bug fixes. InfluxDB 3 Core is free and open source, optimized for recent data, and licensed under MIT and Apache 2.

Is Your "Single Pane of Glass" Leaving You Blind to the Real Problem?

In the push to simplify IT management, the idea of a single, all-encompassing AIOps platform is certainly appealing. The promise of one dashboard to monitor the entire IT stack—from applications and infrastructure to the network—suggests a world of streamlined operations. This generalist approach aims to provide a broad overview, correlating data from across the business to spot trends and potential issues.

Introducing new issue detectors: Spot latency, overfetching, and unsafe queries early

Not everything in production is on fire. Sometimes it’s just... a little warm. A page that loads a second too slow. An API that returns way more than anyone asked for. A query that feels totally fine until someone sends something unexpected and suddenly you’ve got an incident.

5 Notable Examples of Network Maps and Diagrams

A network map is a visual representation of the devices and connections that make up an IT network. For IT professionals, network maps are essential tools for monitoring performance, troubleshooting issues, enhancing security and planning infrastructure upgrades. There are multiple types of network maps, each serving a specific purpose, ranging from physical layout diagrams to cloud-based and security-oriented architectures.

RUM Versions: one click deployment tracking

Deployments should drive your product forward, not slow you down. Yet too often, teams spend hours digging through logs, dashboards, and error reports just to answer a simple question: did the release go smoothly? Coralogix’s new Versions feature answers this in a single click, letting teams spend more time building and less time investigating.

Integrating CI/CD Pipelines with Observability Tools

CI/CD pipelines are automated workflows that take code from development to production. The CI/CD pipeline meaning encompasses two key practices: A typical CI/CD pipeline includes stages like code compilation, testing, security scanning, artifact creation, and deployment across multiple environments.

Why Observability Isn't Just for SREs (and How Devs Can Get Started)

Almost every other day, when I scroll past r/devops or r/sre, I see a post like this asking how a dev can get started with devops, observability, etc. Sample Reddit thread on how to get started with OTel This blog is an attempt for anyone lost to find their way into observability and a wake-up call for devs to they should think about observability more actively today than ever before. A dev’s observability playbook.

AIOps Tools: Key Features and Top 8 Solutions in 2025

AIOps tools use machine learning, big data, and automation to enhance IT operations. These tools analyze IT data, detect anomalies, and automate tasks, improving efficiency and reducing manual effort. Popular AIOps tools include Selector, Splunk, Dynatrace, Datadog, BigPanda, Dell AIOps, IBM Cloud Pak for AIOps, and LogicMonitor.

This Month in Datadog: Bits AI SRE, Datadog Data Observability, and more

Datadog is constantly elevating the approach to cloud monitoring and security. This Month in Datadog updates you on our newest product features, announcements, resources, and events. To learn more about Datadog and start a free 14-day trial, visit Cloud Monitoring as a Service | Datadog. This month, we chat with two guests about Bits AI SRE and Datadog Data Observability.

Netdata Overview: All You Need to Know in Under 3 Minutes

In just a few minutes, this walkthrough will show you how to unlock the full power of Netdata during your trial period. From real-time metrics to AI-powered insights, learn how to get immediate value without any guesswork. Whether you're running a Homelab or managing production systems at scale, this video will help you hit the ground running and make every minute of your trial count. Let’s turn your trial into insight, clarity, and control.
Sponsored Post

AIOps for SAP: From Ground to Cloud

Anyone working in the SAP market in 2025 is aware of two big topics: migration to cloud-based ERP and the end of many long-used tools for managing SAP operations including Focused Run, Landscape Manager and Solution Manager. Both are impossible to ignore. Cloud-based ERP presents a new era of business software possibilities, and with it the opportunities and complexities of migration, transformation, and leveraging the elastic capacity and scalability of cloud-based designs. But right behind it, the question becomes "how are we going to run and manage this?"

Smarter Insights and Pipeline Control - New in DataStream

We’re constantly improving DataStream to make security data management simpler, smarter, and more efficient for modern SOCs. This latest update introduces new capabilities that bring even more visibility and flexibility to your telemetry pipelines. Let’s take a closer look at what’s new.

Real-Time Flight Telemetry Monitoring with InfluxDB 3 Enterprise

When Microsoft Flight Simulator 2024 generates telemetry data at 30-60 FPS, capturing and processing that stream in real-time becomes a fascinating engineering challenge. We built a complete telemetry pipeline that reads over 90 flight parameters through FSUIPC, streams them to InfluxDB 3 Enterprise, and displays them in real-time dashboards that respond in under 5 milliseconds.

Diagnosing Wi-Fi failures that traditional tools miss: a case study

A global airline experienced persistent Google Meet connectivity issues with no apparent network infrastructure faults. While their APM tool offered visibility into network paths, it didn’t surface any local anomalies. Catchpoint’s endpoint monitoring, however, revealed performance degradation specifically on Wi-Fi Channel 44 (5GHz band), where signal strength dropped to -80 dBm compared to optimal ranges of -30 to -50 dBm.

How we're killing YAML fatigue with our new K8s integration process

Kubernetes has rapidly grown in adoption, with more than 84% of surveyed users evaluating or actively using Kubernetes in some way. It has become the go-to container orchestration deployment. As we grow the Coralogix platform, we continuously go back and improve flows that we believe will have a high impact on our user base.

Splunk Expands Data Management Capabilities To Include Ingest Monitoring

Managing data ingestion at scale is no easy task. As organizations onboard hundreds or even thousands of data sources into the Splunk platform for security, observability, and other business-critical use cases, it becomes increasingly complex to ensure data is consistently available and onboarded efficiently.

Multi Factor Authentication for Synthetic Monitoring for AVD

Today, I’ll cover some of the basics of monitoring Multi-Factor Authentication and why ensuring MFA is implemented is essential, particularly in environments where remote access is possible. I’ll cover some recent, specific case studies where a lack of MFA has led to security breaches and the mechanisms the bad actors used.

AI Agents Console: Monitor the behavior and interactions of any AI agent in your stack

With Datadog's AI Agents Console, you can monitor the behavior and interactions of any AI agent that’s a part of your enterprise stack, whether that’s a computer use agent like OpenAI’s Operator, IDE agent like Cursor, DevOps agent like Github Copilot, enterprise business agent like Agentforce, or your internally built agents. You'll have full visibility into every agent's actions, insights into the security and performance of your agents, analytics on user engagement, and measurable business value from every agent, all in a centralized location.

New in APM

Datadog’s Latency Investigator for APM—now in Preview—automatically investigates hypotheses in the background, comparing historical traces and correlating change tracking, DBM, and profiling signals. This helps teams quickly isolate root causes and understand impact without combing through raw telemetry data. You can go from detection to resolution in a single workflow, and generate a pull request to apply a recommended fix, all without leaving Datadog..

Evals are just tests, so why aren't engineers writing them?

You’ve shipped an AI feature. Prompts are tuned, models wired up, everything looks solid in local testing. But in production, things fall apart—responses are inconsistent, quality drops, weird edge cases appear out of nowhere. You set up evals to improve quality and consistency. You use Langfuse, Braintrust, Promptfoo—whatever fits. You start running your evals, tracking regressions, fixing issues, and confidence goes up as a result. Things improve.

Don't fly blind... monitor from your users' perspective.

Most monitoring strategies focus only on what happens inside their applications... but that’s not what your users experience. From your backend to the cloud, through third-party APIs, DNS, CDNs, ISPs, and finally to the user’s device, every link in the chain matters. Without that visibility, you're flying blind when something breaks in your Internet Stack. Catchpoint’s 3,000+ intelligent agents across 100+ countries deliver true end-to-end visibility, capturing every hop, every variable, and every moment of user impact.

Incident IQ integration is here!

We’re excited to launch one of our most highly requested integrations: StatusGator now connects directly with Incident IQ. This powerful new integration bridges the gap between real-time service monitoring and your internal support workflow. Now, whenever someone reports an outage on your public StatusGator page, a ticket is automatically created in Incident IQ—ensuring your IT team can respond quickly and efficiently.

From Alert to Answer in Seconds: Accelerating Incident Response in Dynatrace

It is 12PM and you just start eating lunch when your phone starts buzzing. A storm of different monitoring and system-level alerts start stacking up on your phone and slack. The incident response "war room" opens and downtime communications are being drafted to customers. Your team is under pressure to find the root cause, but you are immediately hit with roadblocks.

Building an Effective Post-Mortem Culture: A Step-by-Step Guide

Post-mortems are the cornerstone of continuous improvement in incident management. When done right, they transform failures into learning opportunities and prevent future outages. Yet many teams struggle to build a culture where post-mortems are valued rather than feared.

What is Grafana Cloud? Fully Managed Observability Built on Open Standards | Grafana Labs

Grafana Cloud helps teams detect, investigate, and resolve incidents faster—thanks to AI, open standards, and seamless integrations with OpenTelemetry, Prometheus, Salesforce, and more. See how it all works in this live demo of a simulated e-commerce outage.

New in OTel: Auto-Instrument Your Apps with the OTel Injector

As distributed systems scale, maintaining manual instrumentation across services quickly becomes unsustainable. The OTel Injector addresses this by automatically attaching OpenTelemetry instrumentation to applications, no code changes needed. This blog covers how the OTel Injector works, how it integrates with Linux environments, and how to set it up for consistent telemetry across your stack.

Why Your Loki Metrics Are Disappearing (And How to Fix It)

Grafana Loki is up and running, log ingestion looks healthy, and dashboards are rendering without issues. But when you query logs from a few weeks ago, the data's missing. This is a recurring problem for many teams using Loki in production: while the system handles short-term log visibility well, it often lacks the retention guarantees developers expect for historical analysis and incident review.

Disposable Code Is Here to Stay, but Durable Code Is What Runs the World

Every day I seem to run into yet another post with someone solemnly opining that “writing code has never been the hardest part of software engineering. And hey, that’s smashing. As an engineer from the ops/infra/SRE side of the house, I feel like I’ve been saying this my whole career. (Is there anything more satisfying than being proven right in public? Not in my book.) So, which is it?

Data Observability: Build confidence in the data life cycle

Datadog Data Observability provides a complete solution with quality checks (e.g., volume, row changes, freshness), custom SQL-based monitors, anomaly detection, column-level lineage across systems like Snowflake and Tableau, full pipeline visibility, and targeted alerts when data issues arise.

Scaling Online Game Infrastructure for High-Engagement PvM Content

The explosive popularity of player-versus-monster (PvM) content in online games brings significant backend challenges, particularly as titles scale globally. Instanced boss fights, real-time combat logic, and mass player concurrency demand robust, responsive server infrastructure that can scale both horizontally and vertically - without degrading the player experience.

How to Measure VoIP Quality & MOS Score (Mean Opinion Score)

Are you tired of constantly dropping calls or struggling to hear your loved ones on the other end of the line? Fear not, because we're here to talk about the one thing that can make or break your VoIP experience: MOS score. No, we're not talking about the fuzzy creature from Star Wars - we're talking about the Mean Opinion Score, the nifty little metric that can help you measure and improve the quality of your VoIP calls.

Proven escalation policy framework (w/ templates & checklists)

I bet every support team lead has had that moment — a critical incident spiraling out of control because nobody knew exactly when or how to escalate it. Been there, done that. But here's the thing — most organizations treat escalation policies as an afterthought, usually cobbling together makeshift procedures only after a major incident has already caused havoc. There's nothing wrong with learning from experience, of course. It's just not the best approach. So what's better?

Coralogix secures 188 badges in G2 Summer 2025 Reports

As we cruise through 2025 with momentum from our recent $115M Series E raise, the launch of Olly (our AI agent for observability), and our recognition as a Visionary in Gartner’s Magic Quadrant for Observability Platforms, we’re excited to celebrate another major milestone – earning 188 badges in the G2 Summer 2025 reports! At the heart of every G2 badge we earn is the voice of our customers, and their continued trust is what drives us forward.

Here's how you can monitor your site's SEO performance

SEO is in a weird place right now. About one in five LinkedIn posts in my feed currently claims that SEO is dead, or has been assimilated by LLMs. Do not be remiss, dearest reader, because even an LLM still uses search engines like Google and Bing for web crawling. In other words, SEO still matters, a lot. Additionally, it's never a bad idea to keep tabs your website's SEO performance.

How MSPs Can Offer DNS Monitoring as an Add-On Service

Most MSPs don’t advertise DNS monitoring as a service—but they should. Why? Because when DNS goes wrong, your client won’t blame their registrar or email provider. They’ll blame you. And the worst part? You probably didn’t know anything had changed until the problem reached your inbox.

How to Create a Runbook Template That Actually Gets Used

A runbook template is only valuable if your team actually uses it during incidents. Yet many organizations create elaborate documentation that sits untouched in wikis, gathering digital dust while engineers scramble through incidents without guidance. The difference between a runbook that gets used and one that doesn't comes down to practicality, accessibility, and continuous improvement. Let's explore how to create runbook templates that become essential tools rather than checkbox exercises.

How Prometheus 3.0 Fixes Resource Attributes for OTel Metrics

When you export OpenTelemetry metrics to Prometheus, resource fields like service.name or deployment.environment don’t show up as metric labels. Prometheus drops them. To use them in queries, you’d have to join with target_info: This makes filtering and grouping more difficult than necessary. Prometheus 3.0 changes that. It supports resource attribute promotion—automatically converting OpenTelemetry resource fields into Prometheus labels.

OTel Weaver: Consistent Observability with Semantic Conventions

Deploying a new service shouldn’t break dashboards. But it happens, usually because metric names or labels aren’t consistent across teams. You end up with traces that don’t link, metrics that don’t align, and queries that take hours to debug, not because the system is complex, but because the telemetry is fragmented. OTel Weaver addresses this by enforcing OpenTelemetry semantic conventions at the source.

Grafana 12.1 release: automated health checks for your Grafana instance, streamlined views in Grafana Alerting, visualization updates, and more

It’s official: Grafana 12.1 is here! The latest release delivers new features that simplify the management of Grafana instances, streamline how you manage alert rules (so you can find the alerts you need, when you need them), and more. Grafana 12.1: Download now! Below are just some of the highlights from the latest Grafana release. If you are looking for more details about all the changes in this release, refer to the changelog or the What’s New documentation.

Top 10 Status Page Examples: What We Like and What's Missing

A great status page does more than show uptime—it builds trust, communicates clearly during incidents, and empowers users to stay informed. Here are 10 standout examples of public status pages, with a quick breakdown of what they do well—and where there’s room for improvement.

Unifying Observability: Intelligence, Automation, and Insights in Action

As enterprise IT environments evolve into ever-greater complexity and scale, demands on operations teams are accelerating. In the traditional model, observability tools collect data, engineers manually correlate events, and remediation follows a ticketing trail. However, that approach no longer matches the speed and scale of today’s digital businesses. Even the most storied dashboards can’t address today’s operational needs.

5 Assumptions CIOs Need to Rethink: Monitoring in the Age of Complexity

Today’s digital delivery models have fundamentally changed, yet many CIOs are still using monitoring strategies built for a world that no longer exists. With Internet dependencies, external APIs, SaaS platforms, CI/CD pipelines, and microservices dominating modern architectures, performance and reliability now hinge on systems IT teams don’t fully control. Traditional, reactive monitoring tools fail to provide visibility into the end-to-end experience. They alert you after the customer has already felt the pain.

Observing Vercel AI SDK with OpenTelemetry + SigNoz

LLM-powered apps are growing fast, and frameworks like the Vercel AI SDK make it easy to build them. But with AI comes complexity. Latency issues, unpredictable outputs, and opaque failures can impact user experience. That’s why monitoring is essential. By using OpenTelemetry for standard instrumentation and SigNoz for observability, you can track performance, detect errors, and gain insights into your AI app’s behavior with minimal setup.

Updated MPLAB X IDE Plugin

We’re happy to announce that our Trace Export Plugin for MPLAB X IDE has been updated to version 2.3.1 and now supports the latest versions of Microchip’s IDE, including MPLAB X v6.20 and v6.25. This plugin enables saving trace files from Percepio’s TraceRecorder library via the MPLAB X IDE debugger, making it easy to open the trace in Percepio Tracealyzer and related tools.

How I Use GenAI as a Thought Partner, Not a Shortcut

You don’t need to be a power user to get powerful results. I’m not training models or prompting GPTs into poetry—I’m just using them to do what great managers already try to do: communicate clearly, prioritize outcomes, and lead with intention. Over the last few quarters, I’ve built a handful of custom GPTs to support my weekly, monthly, and quarterly workflows.

How Synthetic Monitoring Can Warm Up Your CDN (and Why It Matters)

In the high-stakes world of web performance, every millisecond counts. A single second of delay can result in a 7% reduction in conversions, while 10% of users will abandon a site for every additional second it takes to load . For organizations operating at global scale, Content Delivery Networks (CDNs) have become indispensable infrastructure for delivering fast, reliable user experiences.

13 Best Log Analysis Tools of 2025. Top Paid, Free & Open-Source Log Analyzers Reviewed

Log analysis and management tools have become essential in troubleshooting. With log analyzers you can extract meaningful data from logs to pinpoint the root cause of any app or system error, and find trends and patterns to help guide your business decisions, investigations, and security. If you’re not already using such a tool, now is the time to start looking for one.

Grafana Campfire - Using the Grafana MCP Server (Grafana Community Call - July 2025)

In this month of the Campfire Community call, we will exploring the Grafana MCP (Model Context Protocol) server - an open-source tool that enables AI assistants to directly interact with your Grafana instance. We will learn some basics such as: Join me (Usman), Matt Ryer, and David Kaltschmidt for this exciting session. Expert guests: Ioanna Armouti, and Luccas Quadros *HELPFUL LINKS* Feel free to use the YouTube live chat feature to start submitting questions, and we will add them to the agenda.

SD-WAN, SASE, SSE, and the Coffee Shop Network: From Distraction to AI Superpower

Back in 2018, I wondered (perhaps loudly if SD-WAN was just IT’s hype-of-the-year, destined for the same eye-rolls as signature-based antivirus and GDPR compliance drives. Even then, I knew we couldn’t let messaging fatigue blind us to real technology shifts. Fast-forward to 2025: SD-WAN (Software-Defined Wide Area Network) not only stuck around, but became the springboard to something far bigger – SASE (Secure Access Service Edge).

How AI Agents Reason, Act, and Automate at Scale

In our previous post, we explored the urgent need for intelligent automation in network automation, specifically how the Model Context Protocol (MCP) enables AI agents to dynamically discover and interact with the necessary tools. But access to tools is only part of the equation. To truly operate autonomously in complex environments, agents need not only connectivity but also intelligence.

7 Clear Signs Your Team Needs Centralized Monitoring

Managing multiple systems without centralized monitoring is like trying to watch security footage from 20 different screens simultaneously. You might catch some issues, but you'll inevitably miss critical problems until they explode into major incidents. If your team is struggling with scattered monitoring tools, delayed incident responses, or constant firefighting mode, it's time to evaluate whether you need a centralized monitoring solution. Here are the key warning signs to watch for.

How to Build Resilient Telemetry Pipelines with the OpenTelemetry Collector: High Availability and Gateway Architecture

Let’s bring that back. Today you’ll learn how to configure high availability for the OpenTelemetry Collector so you don’t lose telemetry during node failures, rolling upgrades, or traffic spikes. The guide covers both Docker and Kubernetes samples with hands-on demos of configs. But first, let’s lay some groundwork.

Why continuous profiling is the fourth pillar of observability

Developers have long used profilers to diagnose performance bottlenecks and improve the efficiency of their code. But a modern version of profiling, continuous profiling, is quietly redefining what profiling is and what it can do. By running nonstop in production with very low overhead, continuous profilers give teams always-on visibility into how their code behaves in the real world.

How Sentry could stop npm from breaking the Internet

Caching is great! When it works… When it fails, it puts a big load on your backend, resulting in either a self-inflicted DoS, increased server bills, or both. This article is inspired by a real-world incident that happened to npm back in 2016. In the next part, Ben recounts his personal experience responding to the incident while working at npm.

StatusGator now supports Microsoft Teams Workflows

We’ve updated our Microsoft Teams integration to support workflows — Microsoft’s new and recommended approach to incoming webhooks. As Microsoft evolves its platform, it is phasing out the legacy Connectors feature in favor of Workflows. At StatusGator, we’re committed to keeping up with these changes so your integrations remain reliable and future-proof.

Observability Data: Ingestion Pipeline Best Practices

Great data is a prerequisite to all things AIOps and observability. Great observability data results in fewer observability gaps, better analysis and insights, and more confidence within teams that rely on the power of modern AIOps and observability technologies. Goals for improved automation, IT efficiencies, intelligent triage and remediation all become more achievable with better data.

Tracking planes with Grafana in real time: How to visualize the aircraft overhead with your own dashboard

Ever since I was little, I’ve been fascinated by airplanes. Whether it was the excitement of boarding a flight for a holiday or the wonder of admiring them from the ground, there’s always been something magical about these incredible machines. Fast forward a few years, and now we have the ability to track aircraft in real-time from the palm of our hands using a variety of apps.

How sum_over_time Works in Prometheus

The sum_over_time() function in Prometheus gives you a way to aggregate counter resets, gauge fluctuations, and histogram samples across specific time windows. Instead of seeing point-in-time values, you get the cumulative total of all data points within your chosen range—useful for calculating totals from rate data, tracking accumulated errors, or understanding resource consumption patterns over custom intervals.

Getting started with the Grafana plugin

The idea of having a SquaredUp plugin for Grafana might seem a little bit unnecessary at first. They are both dashboarding products, so why would you want to create a dashboard about dashboards? The answer to this conundrum is that the SquaredUp Grafana plugin isn’t quite a matter of taking Grafana dashboards and recreating them on the SquaredUp canvas.

How Secure and Healthy Are Your Custom SCOM Management Packs?

Thanks for using the NiCE Log File Management Pack. We know it’s a favorite among experts building custom SCOM Management Packs. But here’s a quick question: When was the last time someone checked your custom Management Packs for security vulnerabilities, performance bottlenecks, or health risks?

OpenTelemetry NestJS Implementation Guide: Complete Setup for Production [2025]

NestJS applications require comprehensive monitoring to ensure optimal performance and rapid issue resolution. As your application grows—spanning multiple services, databases, and external APIs—understanding what's happening under the hood becomes critical. That's where OpenTelemetry comes in. OpenTelemetry provides vendor-agnostic observability for your NestJS applications through distributed tracing, metrics, and logs.

Use Telegraf Without the Prometheus Complexity

Every system needs observability. You need to know what your CPU, memory, disk, and network are doing, and maybe keep an eye on database query latency or Redis connection counts. But setting that up isn’t always simple. You start with a couple of shell scripts. Then come exporters. Then Prometheus. Before long, you’re managing scrape configs, tuning retention, and watching dashboards fail under load after two days of data.

SMS alerts enabled for Early Warning Signals

When service disruptions happen, every second counts. That’s why we’re excited to announce a major update to StatusGator: Early Warning Signals are now available via SMS. Early Warning Signals have already been helping teams stay ahead of outages via email and Slack alerts — and now, with SMS support, you can get real-time notifications directly on your phone, even before incidents are publicly acknowledged.

Securely query data sources on your Tailscale network using Private Data Source Connect in Grafana Cloud

Balancing security with your observability needs can be a difficult task. We know our users want to leverage platforms like Grafana Cloud to visualize and gain valuable insights into their data, while also keeping their data sources private and secure.

Advanced Proactive SSL Certificate Monitoring

eG Enterprise version 7.5 introduces advanced capabilities for detailed SSL Certificate Monitoring including monitoring for web servers and apps using SSL. Monitoring SSL certificates is essential to ensure secure connections, prevent service outages, and maintain user trust. Here are a few things you need to monitor and questions you should ask to keep your services and apps running reliably and securely.

What is Java Performance Monitoring? [A Guide to DevOps Engineers]

You rolled out a Java application that worked fine in development. Fast, clean, no errors. However, once it went into production, things began to change. Suddenly, the app feels slow. CPU usage climbs without warning. Some users start getting timeouts. You check the dashboards, but nothing jumps out. You look through the logs, but it's mostly noise. And then the questions start coming in - "Is the JVM the problem?" If you've been in that situation, you're not alone.

The Benefits of Visibility in Higher Education Networks

Higher education institutions face unique cybersecurity challenges due to their complex networks, diverse user base and open academic environments. With thousands of students, staff and faculty members accessing resources from various locations and devices, universities must have visibility of what’s happening on their networks and robust and responsive cybersecurity protection to help safeguard them.

Throughput Upgrade (With Train Illustrations!)

URLs) receive spiky traffic:. The Healthchecks open-source project includes a fully functional, tested and type-annotated ping handler written in Python. On self-hosted Healthchecks instances, when you send an HTTP request to a ping URL, a Django view collects and validates information from the request, then uses Django ORM to update a Check object in the database and insert a Ping object in the database. This approach is good for tens to low hundreds of requests per second, depending on hardware.

Zero Trust Starts with Zero Blind Spots

Zero Trust is more than a buzzword in today’s cybersecurity playbook, it’s a strategic imperative. Federal agencies, defense operations, and civilian infrastructure providers are all under mounting pressure to deploy Zero Trust Architecture (ZTA) frameworks that are not only compliant but truly effective. But there’s a problem: Zero Trust can only succeed if it’s built on real-time, actionable insight. That means eliminating blind spots.

Taming Your Dynatrace Bill: How to Cut Observability Costs, Not Visibility

Dynatrace is a powerhouse for application performance monitoring and business analytics. But for many organizations, its power comes with a significant challenge: as applications scale across complex hybrid environments and diverse tech stacks, the sheer volume and variety of logs, metrics, and traces sent to the platform can explode, leading to staggering and unpredictable costs.

Debug live production issues with the Datadog Cursor extension

The Datadog Cursor Extension uses the Datadog remote MCP Server to give developers access to Datadog tools and observability data directly from within the Cursor IDE. The Cursor Extension enables you to view live variable values that your logpoints capture during execution, and you can use the Cursor Agent to identify the lines of code responsible for the issue at hand. The Datadog Cursor Extension is now available in Preview.

AI-Driven Alert Correlation with EventiQ in Splunk ITSI

In this video, we introduce EventiQ in Splunk ITSI, a powerful AI-driven solution designed to cut through the noise and help you find the root cause of issues faster. We’ll show you how EventiQ automatically analyzes and groups related alerts into actionable episodes, significantly reducing alert volume. We’ll cover how to enable EventiQ for a Notable Event Aggregation Policy and review the resulting episodes that it creates.

Seeing the Bigger Picture: Why Security Needs Depth, Not Just Products

A recent BBC article, “Weak password allowed hackers to sink a 158-year-old company,” outlined a serious security lapse. This case reinforces the message that we, at Teneo, advocate every day: true resilience comes from defense in depth, i.e. policy, product and process, not just tools at the edge. In a recent customer engagement, we discussed a transition from VPN to ZTNA. While ZTNA offers enhanced security including continual checking, improved segmentation and a minimized attack surface.

AWS Summit NYC 2025: Laser-Focused on AI

If you’re unfamiliar with AWS Summits, these are conferences that occur on a yearly basis in different cities. The events are mostly used to announce new products and technologies. This year, the theme was AI, as evidenced by the keynote, a large majority of the talks, and a walk around the vendor floor. The keynote talk was hosted by Swami Sivasubramanian, VP of Agentic AI at AWS.

Monitoring Ruby on Rails applications with Applications Manager

Ruby on Rails is the go-to framework for organizations to build flexible, database-driven web applications with high speed and efficiency. Enterprises of all sizes rely on it to build user-friendly applications. But like any other modern web stack, optimizing the performance, availability, and reliability of Rails applications, especially in production environments, requires more than just reactive bug fixes.

Why Your Business Needs APM: 10 Key Benefits You Shouldn't Ignore

In today’s digital world, how well your applications perform has a big impact on how people see your business, and how well it runs. Whether you are in finance, e-commerce, SaaS, healthcare, or media, your users expect everything to work smoothly, all the time. Even a few seconds of slow performance can lead to lost sales, lower productivity, and unhappy customers. That’s why Application Performance Monitoring (APM) is so important.

How Datadog Cloud Network Monitoring helps you move to a deny-by-default network egress policy at scale

When organizations first begin deploying workloads on Kubernetes, it's common for them to start with a permissive egress traffic policy that allows any workload to reach the internet. This approach can make it easier for teams to stay agile and to get services up and running in fast-moving environments. But as your Kubernetes footprint grows, it's important to minimize public internet access on a per-workload basis to improve your organization's security posture.

How SAP achieved world-class uptime through modern observability

SAP Customer Experience (CX) has undergone a remarkable transformation over recent years, evolving from fragmented monitoring to a scalable, automated observability powerhouse. In a recent fireside chat, Martin Norato Auer, SAP CX’s VP of Observability, shed light on the strategies, practices, and measurable impacts behind SAP’s SLA, uptime, and responsiveness achievements.

Taking AI Apps From Prototype to Production

At this year’s AWS Summit in New York, agentic AI took center stage with Amazon’s launch of Bedrock AgentCore — a powerful step toward turning AI prototypes into scalable, production-ready applications. From low-code workflows to turnkey infrastructure, a new generation of tools is enabling teams of all skill levels to build, deploy, and monitor AI agents faster than ever.

Zero instrumentation distributed tracing is here: Meet OBI on Open Telemetry

Modern systems generate enormous amounts of telemetry. The hurdle is collecting clean, connected traces without rewriting code or babysitting a fleet of language agents. That’s why Coralogix backed eBPF from the start. eBPF (extended Berkeley Packet Filter) executes sandboxed programs inside the Linux kernel, without modifying kernel source code. This method allows probes to see every request, at runtime with no instrumentation, and with near zero per‑request overhead.

Introducing Sentry's Godot SDK 1.0 Alpha, with support for Godot 4.5 Beta

Debugging during development is easy. You've got a debugger, stack traces, and logs right in front of you. But once your Godot game is in the hands of players, things get trickier. Most won’t report bugs, and if they do, you’re lucky if they include anything more than “it crashed”.

How to Create Playwright Scripts for Website Monitoring with Chrome, ChatGPT & Sematext

Let’s say you want to make sure your website works as expected. You do not want to check if it just loads. You also want to check if important buttons or features are there and working. Oh, and you don’t want to just do it once. You want to keep an eye on this pretty much all the time. And, of course, you don’t want to keep checking manually if anything broke – you want to be notified, alerted when (not if) things break. You can do this by creating a Browser Monitor.

Introducing the new search box on StatusGator

Recently we hit an exciting milestone at StatusGator: 6,000+ services now tracked! To mark the occasion, we’ve made it even easier to find the apps you care about and report outages: A brand-new, lightning-fast search box is now live on the StatusGator website. It’s built right into the top navigation, accessible from any page — and works beautifully on mobile, too.

How the StatusGator name was born

More than 10 years ago, StatusGator pioneered the concept of a status page aggregator. How was the StatusGator name created? In a group chat, of course! The incubator of crazy ideas and nerdy discussions, our friendly group chat was a place where I originally discussed the product idea, validated its use cases, and solicited feedback on name concepts. I recently unearthed screenshots from the original group chat among my friends.

Grafana Cloud updates: deeper insights in Kubernetes Monitoring, Adaptive Metrics updates, and more

We consistently roll out helpful updates and fun features in Grafana Cloud, our fully managed observability platform powered by the open source Grafana LGTM Stack: Loki for logs, Grafana for visualization, Tempo for traces, and Mimir for metrics. In case you missed them, here’s our monthly round-up of the latest and greatest updates in Grafana Cloud. You can also check out our What’s new in Grafana Cloud documentation to explore all the latest features. Not a Grafana Cloud user yet?

Is Your Network Ready for the Perfect Storm?

For decades, the corporate network has been the central nervous system of the enterprise. It’s the invisible, indispensable fabric that connects everything. And for just as long, the conversation has been about its growing complexity. But today, something feels different. You are no longer dealing with a predictable, manageable evolution. Instead, three immense, converging forces are creating a perfect storm, pushing traditional network management approaches to their breaking point.

Bringing GitLab Logs into Focus with Graylog

GitLab’s audit logs offer a goldmine of insights into user activity, project changes, and security events. Getting that data into Graylog for centralized analysis is easier than you might think—especially with the flexibility of our Raw HTTP input and Illuminate’s GitLab Spotlight Pack. In this two-part guide, we’ll walk you through how to get it done, from wiring up GitLab’s Audit Event Streaming to visualizing enriched events in a purpose-built dashboard.

Common Network Switch Issues & How to Fix Them

As a network admin, you're probably all too familiar with the importance of your network switches. These devices keep the heart of your network beating by connecting various devices, from computers to printers, and ensuring data flows smoothly. However, switches, like any hardware, come with their own set of issues that can disrupt productivity and cause headaches if not addressed promptly.

Bits AI Dev Agent: Automatically identify issues and generate code fixes

The Bits Dev Agent is an AI-powered coding assistant in Datadog designed to reclaim developer productivity by autonomously monitoring telemetry data, identifying key issues, and generating production-ready pull requests. Developers receive asynchronous, context-rich PRs with clear explanations, allowing them to shift their focus from troubleshooting to reviewing solutions and building better code.

Introducing Bits AI SRE, your AI on-call teammate

Bits AI SRE is your AI on-call teammate, built to autonomously investigate alerts and coordinate incident response. Integrated with Datadog, Slack, GitHub, Confluence, and more, Bits analyzes telemetry, reads documentation, and reviews recent deployments to determine the root cause of alerts—often before you’ve even opened your laptop. In fact, if you're using Datadog On-Call, you can view Bits’s findings right from your phone—so you’re always one step ahead, no matter where you are.

Datadog Incident Response: Unify remediation and communication

With Datadog's new AI voice agent in Incident Response, you can quickly get up to speed on the issue and start taking action directly from your phone. Handoff notifications make it easy to jump straight to the relevant context and quickly communicate with other responders. Finally, our status pages enable you to automatically update users on your remediation progress.

Performance Attribute widgets | Site24x7 Custom Dashboards

Learn how to visualize, analyze, and optimize real-time performance data across your infrastructure using flexible widgets—time series, text, numerical, and more. This video walks you through creating dashboards to track key metrics, compare attributes, and gain instant insights for faster troubleshooting. Perfect for network admins, IT teams, and anyone looking to boost monitoring efficiency.

You have 200 milliseconds. That's all the time you get to prove your app or website is alive.

200ms is about the speed of a blink of the eye, but it’s the difference between “this site works” and “this site’s broken.” Today’s users expect instant feedback, and that’s why it’s critical to measure from their perspective.

Mistakes To Avoid With Your Public Status Page

A public status page forms the public face of your organization's service availability. It is the first point of contact for your customers to check the status of your services during times of crisis. Hence, ensuring the credibility and uptime of your public status page is crucial to your organization's reputation. In this article we will look at the key mistakes to avoid while hosting and managing a public status page.

Scout Gives Cookpad Actionable, Rails-Specific Performance Insights

For more than a decade, Cookpad, a global platform for recipe sharing and search, has relied on APM tools to monitor critical application performance metrics, like server response times and resource usage. When their previous APM tool became too expensive after price increases, they needed to find a new solution that could check all of their boxes.

Architecting for Value: A Playbook for Sustainable Observability

You’ve built something amazing. Your services are scaling, your users are happy, and your team is shipping code like never before. Then the cloud bill arrives, and one line item makes your eyes water: observability. That Datadog invoice feels less like a utility bill and more like a ransom note. It’s a modern engineering paradox. The tools that give you sight into your complex systems are the same ones that can blind you with runaway costs.

AIOps in 2025: 4 Components and 4 Key Capabilities

AIOps, or Artificial Intelligence for IT Operations, is the application of artificial intelligence and machine learning to automate and improve IT operations. It combines big data analytics, AI, and machine learning to monitor, manage, and optimize IT environments, enabling organizations to proactively detect, diagnose, and resolve issues more efficiently than traditional methods.

Payment Orchestration: Leveraging AI for Smarter Payment Routing and Fraud Prevention

The digital payment landscape has undergone a remarkable transformation with the integration of artificial intelligence technologies. Modern businesses face the challenge of managing complex payment ecosystems while maintaining security and customer satisfaction. Payment orchestration emerges as the solution that bridges this gap, creating unified systems from fragmented payment infrastructures.

PDF Redaction for Compliance: Best Practices in IT Monitoring

If you're working in IT, especially in security, audit, or monitoring roles within this industry, you're familiar with the term 'compliance' and understand that it holds significant importance. Because compliance is not just a legal term, it's a set of rules that helps to protect sensitive data, avoid penalties, and save a company's reputation from potential PR problems.

66% of us use AI every day, but do we actually know how it works?

There’s a new kind of thinking happening in the world. It doesn’t come with memories, emotions, or doubt. It doesn’t hesitate. It doesn’t wonder. However, it appears to be thinking. Ask it a question, and it responds in perfect grammar. Ask for help, and it gives you options. You could almost believe it understands. Almost. This is what happens when machines are trained to speak like us, but without ever needing to understand us.

An Introduction to Oban for Elixir Monitoring Using AppSignal

Background task processing is something that many developers may encounter when building Elixir applications. This might include sending emails asynchronously, posting and fetching data from an API, and more. Oban, a powerful and persistent job processing library, offers a reliable way to handle background tasks, scheduled operations, and more. However, like any complex system, Oban requires careful monitoring to ensure its smooth operation, identify bottlenecks, and prevent unexpected failures.

400 Million Reasons Hackers Will Target Microsoft Again...

Yesterday, like many others in the tech community, I found myself pausing to fully grasp the implications of the Microsoft SharePoint hack. As one of the most widely adopted document management and collaboration platforms globally, SharePoint’s compromise inevitably sends ripples of concern through businesses everywhere. This news reminded me of a conversation I had just last week with an enterprise customer. We were discussing how one might approach cybersecurity from a hacker’s perspective.

What is Python Application Performance Monitoring? - [A Complete Guide]

A recent study looked at real-world Python programs and found something important: Python isn’t the main reason apps slow down. The real problems come from inside the code like poor logic, memory issues, and slow database queries. The problem is, these issues often go unnoticed. Your app may seem fine until users start complaining about slowness or things start breaking under pressure.

Ingest, Explore, Validate: A Quickstart with InfluxDB 3 Enterprise and Explorer UI

Great observability doesn’t just collect metrics—it tells you exactly what’s broken, why it’s broken, and what to do about it. InfluxDB 3 Enterprise delivers this through real-time ingestion, fast queries, and scalable storage. InfluxDB 3 Explorer provides the intuitive interface your team needs for database management, data ingestion, querying, and visualization without the usual complexity.

From Reactive to Resilient: Why CIOs Must Lead the Automation Shift to Achieve True Business Agility

For decades, CIOs have fought to keep pace with rising digital complexity. As IT environments have grown more fragmented and dynamic, operational stability has often come at the cost of strategic agility. But the game is changing. What once required heroic effort to maintain is now table stakes—and the new expectation is that IT won’t just support the business, it will help steer it.

How to Monitor JavaScript Memory Leaks in Production

Remember when JavaScript was just for making snowflakes fall on your GeoCities page? Those were simpler times. Now we’re building entire applications in the browser, and surprise! JavaScript wasn’t exactly designed with memory management in mind. While other languages have garbage collectors that actually, you know, collect garbage, JavaScript’s garbage collector is more like that roommate who promises to clean but just shoves everything under the bed. The real kicker?

Ship Confluent Cloud Observability in Minutes

You're running Kafka on Confluent Cloud. You care about lag, throughput, retries, and replication. But where do you see those metrics? Confluent gives you metrics, sure, but not all in one place. Some live behind a metrics API, others behind Connect clusters or Schema Registries. You either wire them manually or give up. What if you could stream those metrics to a platform built for high-frequency, high-cardinality time series, and do it in minutes?

How to Cut Observability Costs with Synthetic Monitoring and Responsive Pipelines

Platform teams are struggling with observability noise, bloated storage costs, and lack of clarity during incidents. Most teams capture everything all the time, leading to expensive, overwhelming, and often unnecessary data volumes. In Telemetry for Modern Apps, Mezmo teamed up with Checkly to demonstrate how synthetic monitoring triggers and responsive telemetry pipelines can help reduce costs while maintaining the context needed during incidents.

Six platform updates giving you time back in your day

Ever look at your to-do list at the end of the day and realize it’s grown longer, not shorter? We get it—there’s always more to do and never enough time. But if you’re a Sumo Logic user, reading this blog will be a win for your day because we’re giving you six ways to slash the time you spend on tasks in your platform.

Silent Support Systems and The Infrastructure That Keeps Factories Running

What keeps a factory running when no one's watching? Behind every smooth production line is a network of support systems (compressed air, steam, HVAC, fuel delivery, hydraulic circuits) that operate quietly but are vital to performance. These systems don't grab attention like robotics or automation, yet they prevent downtime, protect equipment, and ensure safety. Ignoring them can lead to unexpected failures and costly interruptions. Understanding how they function and why they matter is essential for maintaining efficient, reliable operations in any industrial setting.

NiCE Active 365 Management Pack 4.4 for Microsoft SCOM

We’re thrilled to release NiCE Active 365 Management Pack 4.4 for Microsoft SCOM. The new 4.4 release is packed with powerful new enhancements driven by customer input and evolving needs. It especially focuses on improving monitoring capabilities for Azure-based services and ensuring compatibility with Microsoft’s evolving ecosystem.
Sponsored Post

Atlassian Jira Monitoring on Microsoft SCOM

As part of a customer project, we developed a custom Jira Management Pack for Microsoft System Center Operations Manager (SCOM). This tailored solution enables IT operations teams to monitor key performance and health metrics of Jira environments, ensuring planning and bug-tracking platforms remain available and performant. With this Use Case paper, we want to share our knowledge with the SCOM Community to highlight the possibilities of advanced monitoring on Microsoft SCOM, helping teams get better in their day-to-day tasks.
Sponsored Post

Streamlining multi-cloud complexity with unified observability

A wave of businesses are embracing multi-cloud strategies to gain flexibility and scalability. By combining on-premises infrastructure, private clouds, and public platforms like AWS, Azure, and Google Cloud Platform (GCP), IT teams can experiment, deploy, transform, and improve their IT applications significantly. On the down side, this modern IT approach of employing multiple clouds (in both public and private forms) also brings significant complexity, making it challenging to monitor systems, control costs, and secure environments. There are just too many threads to track and tie together to ensure a taut IT fabric.

The Ultimate Network Assessment Template for Your Business

In the fast-paced realm of IT businesses, it's easy to overlook the intricate web that powers your operations – your network infrastructure. Let's face it, most enterprises only give it the attention it deserves when something goes wrong. And by then, the issue has often snowballed into a full-blown crisis.

MTTR, MTBF, MTTA & MTTF - Metrics, examples, challenges, and tips

When your system crashes at 3 AM and customers start flooding your support channels, every minute feels like an eternity. Mean Time to Repair (MTTR) measures exactly how long these painful moments last and more importantly, how you can make them shorter. MTTR tracks the average time between when a failure occurs and when your system is fully operational again. This metric directly impacts customer satisfaction, revenue, and your team's sanity during incident response.

OpenTelemetry at Grafana Labs: the latest on how we're investing in the emerging industry standard

Here at Grafana Labs, open source has always been core to what we do. So it should come as no surprise that we’re going all in on OpenTelemetry—an open source project that’s quickly becoming an industry standard for vendor-neutral telemetry.

Monitor Nginx with OpenTelemetry Tracing

At 3:47 AM, your NGINX logs show a 500 error. Around the same time, your APM flags a spike in API latency. But what's the root cause, and why is it so hard to correlate logs, traces, and metrics? When API response times cross 3 seconds, identifying whether the slowdown is at the NGINX layer, the application, or the database shouldn't require guesswork. That's where OpenTelemetry instrumentation for NGINX becomes essential.

How to Set Up Real User Monitoring

Synthetic monitoring provides consistent, repeatable results, 2.1s load times, passing Lighthouse scores, and minimal variability. But those numbers reflect lab conditions. On slower networks, like 3G in Southeast Asia, real users may see much higher load times, 5.8s or more. This isn’t a fault of the tools. It’s a difference in testing context. Synthetic tests run on fast machines, stable connections, and clean environments.

VirtualMetric Achieves SOC 2 Certification: A Milestone in Trust and Security

We’re excited to announce that VirtualMetric has achieved SOC 2 Type 2 certification. This is a key step in our mission to deliver secure, resilient, and efficient telemetry solutions. This certification confirms that our controls for security, availability, confidentiality, and data integrity don’t just look good on paper — they work in practice, over time.

VirtualMetric in the 2025 Comprehensive Market Guide: Rising Data Pipeline Security

Over the past year, much of cybersecurity’s attention has centered on the promise of AI-powered SOCs. But as the Market Guide 2025 by Francis Odum reveals, the true foundation of modern security success lies in the data layer. “Without clean, well-routed telemetry, even the smartest AI is starved of context,” points out the researcher. And that’s where Security Data Pipeline Platforms (SDPPs) have become essential.

VirtualMetric Earns ISO 27001:2022 Certification: Security at Every Level

We’re excited to share that VirtualMetric has officially achieved ISO 27001:2022 certification, a globally recognized standard for building and managing an effective Information Security Management System (ISMS). This confirms that we’ve implemented robust controls to protect data, manage risks, and ensure the resilience of our infrastructure in today’s security landscape.

From Sequential Bottlenecks to Concurrent Performance: Optimizing Log Processing at Scale

We optimized log processing pipeline by moving from sequential to concurrent processing at the entry level, achieving 30% higher throughput and better resource utilization without increasing infrastructure costs. When customers start sending millions of logs per minute, you quickly discover whether your processing pipeline can actually scale with vertical scaling.

Will AI Speed Development in Your Legacy App?

Some people can get an AI assistant to write a day’s worth of useful code in ten minutes. Others among us can only watch it crank out hundreds of lines of crap that never works. What’s the difference? There are some skills specific to AI development. There are also properties of the codebase we’re working in that make it amenable to AI assistance. Most AI demos use projects created from scratch with AI in mind—cute.

The Hidden Cost of Not Using APM in Production

Many organizations don’t realize how important it is to monitor how their applications run in production. Without Application Performance Monitoring (APM), it becomes difficult to detect and resolve issues quickly, leading to increased downtime, wasted developer effort, and poor user experience. These hidden costs, though not always visible at first, can impact customer satisfaction, reduce team efficiency, and result in lost revenue.

IT Service Performance Monitoring: Key Metrics, Best Practices, and Future Trends

As organizations rely more on complex IT systems and cloud-based services, keeping everything running smoothly — and reliably — has become a top priority. That’s where IT service performance monitoring comes in, giving teams the visibility they need to make sure systems stay healthy and responsive. By tracking a range of technical and user-focused metrics, businesses can quickly identify and address issues before they impact operations or end users.

Autonomous Operations Are Here

ScienceLogic’s vision for IT operations isn’t just about improving tools—it’s about changing the entire paradigm, flipping your day-to-day upside down. We’re moving beyond dashboards and alerts, beyond human-only workflows and rules-based systems. The future is autonomous. It’s intelligent. It’s agentic. And it’s already being realized through the power of Skylar AI.

The AI Monitoring crisis that no one's talking about

When I spoke at AWS London earlier this year, I had the chance to discuss something that more and more teams are starting to feel: traditional observability doesn’t cut it for AI systems. In AI, “Is it running?” is no longer enough. We have to ask, “Is it right?” When I delivered that line, I saw the heads nodding. Everyone’s excited to build with LLMs, but when it comes to actually monitoring them in production? That’s where things fall apart.

Get started with Grafana Alerting: Multi-dimensional alerts and how to route them

In this tutorial, we dig into more complex yet equally fundamental elements of Grafana Alerting: alert instances and notification policies. Don't miss the rest of the "Get started with Grafana Alerting" series! Each part dives into a different feature to help you get the most out of alerting in Grafana.

10 Essential Tips for Setting Up Monitoring for Your SaaS

Setting up monitoring for your SaaS application is crucial for maintaining reliability and keeping customers happy. Without proper monitoring, you're essentially flying blind – unable to detect issues before they impact users or understand how your system performs under different conditions. Here are 10 essential tips to help you build a comprehensive monitoring strategy for your SaaS application.

Top 3 Intune reporting tools: SquaredUp, Microsoft admin center, and Power BI

As the unsung hero of modern endpoint management, Microsoft Intune quietly ensures security, compliance, and seamless user experiences across a range of devices and platforms. Where many organisations go wrong, however, is not having the right tool to monitor and leverage Intune’s full potential. But for an organization relying on Intune, what tool should you use?

Why Use a Status Page Aggregator?

Managing multiple vendor dependencies has become a critical challenge for modern businesses. When your operations rely on dozens of third-party services, tracking their status individually becomes inefficient and risky. A status page aggregator solves this problem by consolidating all vendor status information into a single dashboard.

How to Choose the Best Vendor Monitoring Platform for Your Team

Modern businesses rely on dozens of third-party services to operate effectively. When AWS goes down, your application might crash. When Stripe has issues, payments fail. When Slack experiences an outage, team communication grinds to a halt. Vendor monitoring platforms help you track the health of these critical dependencies before they impact your operations. But with numerous options available, selecting the right platform requires careful evaluation of your team's specific needs and workflows.

Golang Application Performance Monitoring: A Comprehensive Guide

Application Performance Monitoring (APM) refers to the practice of tracking, analyzing, and optimizing the performance and availability of software applications. When it comes to Go (Golang), a language known for its concurrency, speed, and efficiency, APM becomes crucial to ensure that your applications stay fast, reliable, and scalable under real-world loads. APM in Go involves monitoring the runtime behavior, request response times, system resource usage, and error patterns across your application.

Risk Register for SREs: A Practical Guide to Proactive Incident Prevention

A risk register is one of the most powerful tools in an SRE's arsenal for maintaining system reliability. By systematically documenting potential threats to your infrastructure and services, you can shift from reactive firefighting to proactive risk management.

Set Up ClickHouse with Docker Compose

ClickHouse is built for high-performance OLAP workloads, capable of scanning billions of rows in seconds. If your analytical queries are bottlenecked on PostgreSQL or MySQL, or you're burning too much on Elasticsearch infrastructure, ClickHouse offers a faster and more cost-efficient alternative. This blog walks through setting up ClickHouse locally with Docker Compose and scaling toward a production-grade cluster with monitoring in place.

Stream AWS Metrics to Grafana with Last9 in 10 minutes

It’s 2:47 AM and your Lambda functions are timing out. API response times are spiking. You’re flipping between the CloudWatch console, your APM tool, and your logs, trying to figure out what’s going wrong. CloudWatch has the metrics you need: CPU usage, memory pressure, and request rates — but connecting that data to what your app is doing takes time. The delay in stitching it all together slows down your incident response.

I built an MCP Server for Observability. This is my Unhyped Take

Recently, I read a blog titled “It’s The End Of Observability As We Know It (And I Feel Fine)”, which discussed MCP servers in observability and how these systems would potentially be the “end of observability”. As someone who has spun up an MCP server for an observability backend and as someone who has been in the space for a while, I certainly do not think so.

Cloud or Self-Hosted - Which Deployment Model is Right For You?

Choosing the right observability platform is a critical decision. But how you deploy it is just as important. The right deployment strategy can accelerate your team, simplify operations, and ensure you meet compliance and security requirements. The wrong one can lead to operational headaches and slow you down. At SigNoz, we believe in flexibility. There is no single "best" way to deploy an observability platform; there's only the way that's best for you.

What's New with Progress WhatsUp Gold 2025.0

Efficient network monitoring starts with visibility and control. The latest release of Progress WhatsUp Gold 2025.0 will help you stay ahead of issues and maintain a healthy, secure network. Join our upcoming session to explore how the newest enhancements simplify monitoring, improve workflows, and provide deeper insights into your infrastructure.

The Dashboard That Lets You Track the ISS in Real Time | Golden Grot Awards | Grafana Everywhere

Ruben Fernandez turned his love for space into a stunning ISS dashboard that won the Golden Grot—twice. Watch how he brings data and dreams together. Congratulations to Ruben Fernandez, our 2025 Golden Grot Award winner, recognized for this unique use case and dashboard! Grafana Cloud is the easiest way to get started with Grafana dashboards, metrics, logs, and traces. Our forever-free tier includes access to 10k metrics, 50GB logs, 50GB traces and more.

How to monitor your Laravel app for critical vulnerabilities using Oh Dear

A critical security vulnerability was recently discovered in Livewire v3 that allows remote code execution, as Stephen Rees-Carter reported on Securing Laravel. While patches are released quickly, many applications remain vulnerable because developers simply don't know about the issue yet. Oh Dear's Application Health monitoring solves this by continuously checking your production environment for security vulnerabilities and immediately notifying you when issues are detected.

Release v2.6: MCP Server, AI Insights Enhancement, Okta SCIM Integration, SNMP Monitoring and more.

Netdata 2.6.0 is here and it’s our most intelligent release yet! This version brings AI-powered monitoring, easier network visibility, and smoother enterprise integrations, all designed to help you troubleshoot faster and scale smarter. What's New: Netdata Referral Program Every referred user will get a 10% discount when they subscribe to Netdata Business or Homelab - and you will receive 10% of their subscription value (up to a max of 1000$ per space). You can refer an unlimited number of users, so there's no real limit to how much you can earn with the referral program.

Overview of Alerts, Real-Time Analysis, & Traceroute

Learn how Uptime.com alerts you the moment a check goes Up or Down, complete with technical details and root cause analysis for API and Transaction checks. Dive into Real-Time Analysis to track outage timelines and get detailed insight into every alert. Plus, see how Traceroute from global or private probe servers helps identify connection issues quickly and accurately. Stay informed. Respond faster. Resolve smarter.

The Case for Intelligent Automation in Network Operations

In the last decade or so, network infrastructure has undergone a massive transformation. With the rise of hybrid cloud, distributed applications, and software-defined everything, managing networks has become exponentially more complex. What used to be a stable, predictable environment is now a constantly evolving system of interconnected services, protocols, and devices, each with its own telemetry, APIs, and failure models.

Top tips: How to be a beginner again

Top tips is a weekly column where we highlight what’s trending in the tech world and list ways to explore these trends. This week, we're talking about what it really means to start fresh, stay curious, and make space to be a beginner again—even when your calendar’s packed. If your calendar is crammed with back-to-back meetings, messages that never stop, and deadlines breathing down your neck, you're not alone.

Automate the removal of agents with Opslogix Lifecycle Management Pack

Automate the removal of agents with Opslogix Lifecycle Management Pack Lifecycle Management is something that falls behind in many SCOM environments. It is common for organizations to reach out to us for help with manually removing agents when servers are no longer in use. To decrease the manual tasks and automate the removal of agents, we created the Opslogix Lifecycle Management Pack.

How APM Can Improve Your Digital Customer Experience?

When a customer taps a button, submits a form or waits for a page to load, they’re not thinking about your backend architecture, microservices, or CDN; they want it to work instantly. But when it doesn’t, the frustration is immediate. Maybe the app freezes. Maybe a checkout fails. Maybe the entire experience just feels laggy. And the worst part? They don't complain, they just leave the application.

Query and Analyze Logs Visually, Without Writing LogQL

It’s 2 AM. An incident’s in progress. Error rates are climbing. You jump into the logs, filter by service, adjust the time window… and now you need a LogQL query. You write one. It errors out. You fix the syntax, try again, only to realize you need a different filter or a new aggregation. Back to rewriting. By the time you’ve got the query right, you’ve already lost 10–15 minutes. The system is still broken, and you still don’t know why.

Trace Go Apps Using Runtime Tracing and OpenTelemetry

When your Go service hits 500ms latencies but CPU usage is flat, tracing gives you visibility into what the profiler misses. With 1–2% runtime overhead, Go’s built-in tracing tools help you: This makes it easier to debug performance regressions that don’t leave a clear footprint.

Honeycomb Named a Visionary in the 2025 Gartner Magic Quadrant for Observability Platforms

In the era of AI, software development is at an inflection point, and observability has never been more critical. Teams are dealing with more code, more data, and more pressure than ever before. To navigate these new challenges, you need a partner with a strong vision for the future and a knack for looking around corners. Honeycomb is proud to be named a Visionary in the 2025 Gartner Magic Quadrant for Observability Platforms.

Elasticsearch is a recommended vector database in the NVIDIA Enterprise AI Factory validated design

Elastic now integrates with the NVIDIA Enterprise AI Factory validated design to provide users with a recommended vector database for their on-premises AI Factories. The validated design provides enterprises with a framework for building and deploying AI Factories on-premises.

Getting started with Dynatrace dashboards

Dynatrace gives you incredibly deep observability data. But all that depth can bury the insights needed. In this blog, we show how to turn Dynatrace's complex telemetry into visual dashboards that actually make sense. Dynatrace is a leading observability and application performance monitoring (APM) platform, known for its deep insight into complex, modern cloud environments. With capabilities spanning infrastructure monitoring, real user monitoring, and security, Dynatrace offers powerful telemetry.

ScienceLogic Wins AI Breakthrough Award for Predictive Analytics Platform of the Year

We’re excited to announce that ScienceLogic has been recognized in the 2025 AI Breakthrough Awards as the winner of “Predictive Analytics Platform of the Year.” This marks our second consecutive win in the program—and highlights our leadership in shaping the future of intelligent automation and observability. As organizations move from traditional monitoring toward autonomous operations, the need for real-time insight, automation, and predictive intelligence has never been greater.

MCP Server on Splunk Cloud Platform Demo

Discover the future of data interaction! This video introduces the Model Context Protocol (MCP) server on Splunk Cloud Platform, a groundbreaking capability that seamlessly connects your Splunk data with advanced AI models (LLMs). Learn how to leverage natural language to query, analyze, and manage your Splunk environment without complex SPL. In this comprehensive setup and configuration guide, we'll walk you through.

Router Monitoring for Network Admins: A How-To Guide

As network admins, we know that routers are the lifeblood of any network. They’re the unsung heroes, routing data from your internal systems to external destinations like the Internet. When routers are performing at their best, everything flows smoothly. But when they’re overloaded, misconfigured, or simply not up to snuff, your network’s performance and security are at risk.

Checkly Is Now Available in the AWS Marketplace

If your team runs on AWS, getting new tools into your workflow isn’t just about functionality. It’s about how quickly you can procure, integrate, and see value. With Checkly now available on AWS Marketplace, monitoring doesn’t have to be an exception. This launch means Checkly fits into your procurement flow the same way it fits into your CI/CD: seamlessly. No vendor approval bottlenecks, no procurement delays, just faster access to the tools your developers already want to use.

DNS Misconfigurations MSPs Can't Ignore

When something goes wrong in a client’s infrastructure, MSPs are expected to fix it—fast. But there’s one area most teams still overlook, and it’s often the first point of failure: DNS. Misconfigured DNS doesn’t always break things immediately. It’s subtle. It lingers. And when it finally causes an outage, broken email, or a security issue, it’s often too late. Here are the DNS misconfigurations MSPs can’t afford to ignore—and what to do about them.

Monitor Lambda-hosted web apps with the Lambda Web Adapter integration

As organizations migrate their legacy web applications from containerized or server-based deployments to serverless environments, they often run into a critical compatibility challenge. Traditional web frameworks like Flask, Express, or SpringBoot are designed to run on persistent HTTP servers, not event-driven, stateless environments like AWS Lambda. The AWS Lambda Web Adapter bridges this gap by allowing teams to run web server-based applications inside Lambda with minimal changes.

How Payconiq Centralized Monitoring and Enabled Real-Time Insights with Elastic

Yannick Boulleys, Head of Platform at Payconiq, shares how Elastic helped the company consolidate fragmented monitoring tools into a single platform. With real-time user monitoring, built-in anomaly detection, and GenAI-powered root cause analysis, Elastic has transformed how Payconiq manages system visibility, consumer behavior, and cost efficiency, without requiring deep technical expertise.

Kubernetes Observability with OpenTelemetry | A Complete Setup Guide

Kubernetes provides a wealth of telemetry data from container metrics and application traces to cluster events and logs. OpenTelemetry offers a vendor-neutral, end-to-end solution for collecting and exporting this telemetry in a standardised format.

Unlock Deeper Insights: Introducing GitLab Event Integration with Mezmo

Following the popularity of our existing GitHub integration, we’ve extended similar capabilities to GitLab users. You can now ingest GitLab events directly into Mezmo Telemetry Pipelines and route them to any destination. This provides a powerful new way to monitor, alert, and react to activity within your GitLab repositories.

Modern Redux Debugging: Common Bugs and Solutions in 2024-2025

Redux remains a cornerstone of React state management, but developers continue to encounter persistent bugs and new challenges. State mutation errors remain the most common Redux bug, affecting over 70% of Redux applications, while new issues emerge with Redux Toolkit 2.0, TypeScript integration, and React 18/19 compatibility. This comprehensive guide explores the most prevalent Redux debugging challenges and provides practical solutions for modern development.

SLA vs SLO vs SLI - Examples, tips, challenges, and key differences

Service Level Agreements (SLAs), Service Level Objectives (SLOs), and Service Level Indicators (SLIs) form the backbone of reliable service delivery. Understanding how these three elements work together helps you build trust with users, maintain service quality, and create accountability across your organization.

Here's how you can build site templates for Oh Dear

When you're managing a handful of client sites, setting things up manually is fine. Though if you're managing dozens of them, you're going to think twice about your approach. For agencies, development teams and platforms who are responsible for loads of websites, having to repeat the same configuration over and over is not only inefficient but also more prone to errors. That’s where this blog post comes in handy.

Kibana Logs: Advanced Query Patterns and Visualization Techniques

Kibana gives you a structured way to explore log data indexed in Elasticsearch. With the right queries and visualizations, you can identify anomalies, debug issues more quickly, and track trends across services. This blog covers practical ways to query logs using Kibana’s Lucene and KQL syntax, build visualizations that surface meaningful signals, and set up dashboards for ongoing log-based monitoring.

Enable Kong Gateway Tracing in 5 Minutes

Kong Gateway is a popular API gateway that sits at the edge of your infrastructure, routing and shaping traffic across microservices. It’s fast, pluggable, and battle-tested, but for many teams, it remains a black box. You might have OpenTelemetry set up across your application stack. Traces flow from your app servers, databases, and third-party APIs. But the moment a request enters through Kong, observability drops off.

Build Log Automation with Last9's Query API

Manual log investigation is one of those engineering tasks that quietly drains hours without offering much real value. You're debugging an incident. Monitoring shows elevated error rates. Now begins the familiar drill: It’s a tedious cycle, and it doesn’t scale. The whole process breaks down when you’re trying to automate incident response, run continuous security monitoring, or generate compliance reports.

Choosing the right OpenTelemetry Collector distribution

The OpenTelemetry (OTel) Collector plays a central role in collecting, processing, and exporting telemetry data. If you’re deploying the Collector in production, chances are you’ve reached for the otelcol-contrib distribution. It’s the easiest, most flexible, and most documented distribution, used in nearly every demo and getting-started guide. But here’s the catch: It’s not actually recommended for production use.

Challenges in AIOps and how to sail through them

AIOps (Artificial Intelligence for IT Operations) is not only a game changer, but the need of the hour as modern IT grows and becomes increasingly complex. The promises of AIOps are both overwhelming and tantalizing. AI-powered monitoring and observability can help predict issues, automatically resolve incidents, and optimize performance across the IT infrastructure. However, onboarding an AIOps monitoring tool can be more complicated than it sounds on paper.

Honeycomb In Your IDE? Yes, With Hosted MCP Now Available in AWS Marketplace AI Agents and Tools Category

I’m pleased to announce the public beta of Honeycomb Hosted MCP, along with our first wave of one-click integrations for Cursor, Visual Studio Code, and Claude Desktop. We’re also very excited to announce that Hosted MCP is available on AWS AI Agents marketplace and for all Honeycomb plans (including our free plan!) at no charge. Honeycomb was built with a singular focus: how do we help teams become better at the art and craft of software development, delivery, and operations?

Key Early Considerations Before Big Architecture or Technology Decisions

‍In this final part, the Scout team continues our talk with Freedom Dumalo, former CTO at Flexcar and current CTO at Vestmark. We discuss some essential questions about architecture, touch on Rails, Turbo, and Stimulus, and the key considerations for those starting off before they lock in an architecture or tech decision. ‍ By the way, before we jump in, Scout Error Monitoring is coming!

Snowflake data visualization: all the latest features to monitor metrics, enhance security, and more

In 2020, we introduced the Snowflake Enterprise data source for Grafana, allowing users to seamlessly pull data from the Snowflake cloud-based data storage and analytics service into Grafana dashboards. Available for Grafana Enterprise and Grafana Cloud users, it’s a powerful way to not only query and visualize Snowlake data, but to do so alongside other data sources, so you can discover correlations and other meaningful insights within minutes.

Your AI Strategy Is Failing in the Seams

There’s a certain comfort in the glow of your network operations center (NOC) dashboards. For some time, the sign of a well-run NOC was that sprawling bank of screens, each dedicated to a different domain. One for the WAN, showing link status. Another for the data center, tracking backbone health. A third for cloud consumption, pulling metrics from your provider. Each screen is a neatly bordered kingdom, diligently monitored by its own set of tools. As long as the lights are green, all is well.

Announcing SystemEDGE 6.5

We are pleased to announce the general availability of SystemEDGE 6.5. For customers using DX NetOps, SystemEDGE is a key component for gaining a comprehensive view of server infrastructure health. It functions as an agent that resides on systems like physical servers or virtual machines. SystemEDGE collects fundamental performance and status information and delivers reports via SNMP.

LogicMonitor in Hybrid Environments: Observability with Edwin AI powered by AWS

As enterprises scale in complexity, the infrastructure landscape is no longer just cloud or on-premises, it’s both. Hybrid is the new normal and it’s here to stay. And with that shift comes a new demand: a unified, scalable observability solution that works across the entire tech stack, from legacy hardware to cloud-native workloads. That’s where LogicMonitor comes in.

Missing container-layer metadata: Why it happens and what you can do

Container image layers provide valuable insight into what goes into a container, including which packages were installed, what commands were run, and where vulnerabilities might live. The metadata associated with these image layers is essential for debugging, optimizing image size, and managing security risks. However, key container-layer metadata fields such as digest, size, and created_by are sometimes missing, which can disrupt important tasks.

ITRS named in Gartner Magic Quadrant for Observability Platforms

When Uptrends became part of ITRS, we knew we were joining a team deeply committed to innovation, precision, and people — whether those people were troubleshooting transaction journeys from their laptops at 8am or keeping enterprise-scale operations online 24x7. We’ve come far since then.

ScienceLogic Named a Visionary in the 2025 Gartner Magic Quadrant for Observability Platforms

It’s official: ScienceLogic has entered the observability arena. Named a Visionary in the 2025 Gartner Magic Quadrant for Observability Platforms, we believe we’re helping define where observability is heading, not just where it’s been. This marks our first inclusion in this Magic Quadrant and, in our opinion, validates our mission to redefine intelligent, actionable observability in the era of AI and automation.

IT Task Automation: Best Practices and Use Cases for IT Management with Pandora FMS

IT teams must handle a large number of tasks on a daily basis. Many of these tasks, while essential, are repetitive: resetting passwords, rebooting servers, monitoring logs for errors, applying patches… When performed manually, they can overwhelm technical staff and compromise operational efficiency. IT automation has emerged as the answer to this challenge. It involves using scripts and specialized tools to automatically execute these and other tasks that previously required human intervention.

Kubernetes Monitoring backend 2.2: better cluster observability through new alert and recording rules

We’re excited to announce version 2.2.0 of the backend for our Kubernetes Monitoring solution in Grafana Cloud is now available. The app’s backend is supported by kubernetes-mixin, an open source Prometheus Monitoring Mixin, and this latest version features significant improvements to alert rules and recording rules that will enhance your cluster observability and monitoring experience. There’s a lot to tell you about, so let’s dive in.

A look back at DASH 2025

DASH 2025 brought the Datadog community together like never before. During our biggest event yet, thousands of attendees gathered at the North Javits Center in New York City for two and a half days of content, learning, and community, where they deepened their knowledge and connected with peers. Here's a quick look back at some of the highlights from this year's DASH.

Proactively troubleshoot with synthetic testing and distributed tracing

As your application grows in complexity, identifying the root cause of issues becomes increasingly difficult. Many monitoring strategies make this even harder by siloing frontend and backend data. To effectively troubleshoot problems that spread across your app, you need visibility not just into each part of your stack, but also into how these parts interact.

Monitor agents built on Amazon Bedrock with Datadog LLM Observability

As large language models (LLMs) grow more powerful, organizations are deploying agentic AI applications to tackle complex, multi-step tasks. With Amazon Bedrock Agents, developers can orchestrate these agents to manage tasks such as triggering serverless functions, calling APIs, accessing knowledge bases, and maintaining contextual conversations—all while breaking down complex user requests or tasks into manageable steps.

Smarter Workflows, Faster Insights: How InfluxDB 3 Unlocks the Power of Python at the Source

Businesses across industries rely on time-stamped data to track system health, monitor performance, and improve operations. Whether it’s sensors on a factory floor or usage logs from a SaaS platform, time series data reveals how things change. As businesses digitize operations and add connected devices, sensors produce growing streams of time-based data. This opens the door to faster analytics and smarter automation. But legacy approaches can’t keep up.

If your site is slow, it might as well be down.

It’s no longer enough for a site to just be available; it had to be fast. If the experience lags, your customers will bounce within seconds. The consequences scale fast: business stops and revenue disappears. You need to monitor performance across the full delivery chain because speed is what keeps users engaged.

From Reactive to Proactive: A User-Centric Digital Strategy for Banks

In today's digital-centric banking environment, financial institutions must be able to provide seamless and reliable application performance across all digital channels - from a branch to a mobile device. Failure to do so results in real impact to customer satisfaction, trust, and loyalty. Modern banking applications are increasingly complex, running off of internet-centric distributed architectures involving many different parties and services. For these modern tech frameworks, traditional APM tools are no longer sufficient to ensure service reliability and optimal customer experience.

Cloudflare's Resolver Outage: More Than Just DNS

“It’s always DNS.” That’s the running joke in IT. When websites won’t load and apps grind to a halt, DNS—the internet’s address book—is often the first to get blamed. That’s because DNS translates human-friendly names like google.com into IP addresses that computers use to route traffic.

How to Troubleshoot Outages Faster Using Elastic Observability [2 Min Live Demo]

In this video, I’ll show you how Elastic Observability helps you reduce downtime, accelerate root cause analysis, and unify logs, metrics, and traces in one powerful dashboard. With native OpenTelemetry support, AI-powered troubleshooting, and built-in anomaly detection, you can streamline your workflows and boost service reliability.

Atatus APM: Full-Stack Visibility for Modern Engineering Teams 2025

APM stands for Application Performance Monitoring or Application Performance Management. It helps engineering teams track key metrics, detect slowdowns, and improve the overall performance of their applications. With Atatus APM, you get complete visibility into your application, from backend code and databases to external services and frontend performance.

Real-Time Alerting for AI-Optimized Data Centers

Kentik transforms real-time network telemetry into actionable alerts for AI-optimized data centers. By converting database queries into custom alerts, engineers can detect issues like elephant flows, idle links, and packet loss before performance suffers and triggers alerts in systems like ServiceNow or PagerDuty.

Identifying Idle Paths in a Data Center Leaf-Spine Fabric

In a perfect leaf-spine network, traffic evenly spreads across all links. But reality is often different, leaving costly, idle paths hidden in your data center fabric. Kentik's Phil Gervasi demonstrates how Kentik's network intelligence platform helps engineers quickly identify and address these underutilized paths. With powerful visualizations, detailed telemetry analysis, and customizable alerts integrated into your ticketing systems, Kentik makes it easy to spot persistent traffic imbalances, troubleshoot ECMP issues, and optimize your infrastructure.

Arie's Adventures with Coroot

Arie van den Heuvel is an engineer, a System and Application Management Specialist, and a valued member of our community. Below he has shared his journey using Coroot, and how it has helped improve observability for his team. You can read more of Arie’s writing and support the resource articles he has created for open source on his blog.

Jaeger Metrics: Internal Operations and Service Performance Monitoring

You're monitoring a microservices-based system. Alerts trigger when response times exceed 2 seconds. But when you open Jaeger, you're faced with thousands of traces. Identifying which service or operation is responsible becomes time-consuming. Jaeger metrics help reduce this friction by exposing aggregated telemetry. Instead of scanning individual traces, you get service-level and operation-level performance metrics, latency, throughput, and error rates that highlight where the issue lies.

Splunk Named a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms

We are proud to announce that Splunk has been named a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms for the third year in a row. In our opinion, our recognition in the Observability category comes on the heels of Splunk being recognized for a tenth consecutive time as a Leader in the 2024 Gartner Magic Quadrant for Security Information and Event Management (SIEM). Splunk was the only vendor named a Leader in both SIEM and Observability for the Gartner Magic Quadrant three times.

Quantifying the True Cost of Healthcare IT Downtime

In today’s hospitals, technology is woven into every touchpoint of patient care. Nurses check vitals through digital monitors. Physicians review test results in the EHR. Medications get ordered, verified, and delivered through a network of connected systems. But when even one link in that chain fails, the impact isn’t just inconvenient—it’s dangerous. Downtime doesn’t just slow operations.

Introducing Coralogix's MCP Server: Helping customers build smarter AI agents

Now available: Secure, real-time access to your observability data via Coralogix’s Model Context Protocol (MCP) Server. AI agents are only as powerful as the context they’re given. Today, we’re excited to announce the launch of the Coralogix MCP Server, which enables third-party AI agents to connect directly to your observability data across production, staging, and other environments.

Getting Started Guide with Netdata

New to Netdata? Start here. In this quick and practical guide, we’ll help you get set up and confident with Netdata in just a few minutes. You’ll learn how to: Access your Netdata Space Connect your nodes—servers, VMs, containers, network devices, and more Organize your infrastructure with Spaces and Rooms Collaborate with your team in real time Explore alerting and integrations Customize notifications so you’re only alerted when it truly matters.

Grok 4 Sets Records - But I'm Focused on Microsoft's 9% Sales Growth

The recent launch of Grok 4 has set the AI community buzzing. With an impressive score of 73 on TLDR’s AI benchmark, Grok 4 edges ahead of OpenAI’s O3 and Google’s Gemini 2.5 Pro, both scoring 70. Elon and the X AI team deserve praise for this breakthrough, reinforcing Grok 4’s status as potentially the most powerful LLM yet.

Why MSPs Can't Afford to Ignore DNS Monitoring

Most MSPs don’t think much about DNS—until something breaks. A record is deleted, an MX entry is misconfigured, or a zone is out of sync. Suddenly, your client’s email is bouncing, their site is down, and your phone is ringing. The problem? DNS issues are easy to miss. They don’t always trigger alerts, logs, or tickets. But when they surface, you’re the one your client calls first.

How to analyze Core Web Vitals in Grafana Cloud Frontend Observability

One of the biggest challenges in frontend development is understanding how users actually experience your application. Slow load times, layout shifts, and a slow response to user interactions can quietly degrade the user experience if they go unnoticed. This is where Grafana Cloud Frontend Observability comes in. Frontend Observability is a hosted service for real user monitoring (RUM) that provides immediate, clear, and actionable insights into the end user experience of web applications.

LDI Connect Steps Up Microsoft 365 and Teams Managed Services with Martello's Vantage DX

LDI Connect is taking its managed IT services to the next level by adding Martello’s Vantage DX platform to its toolkit. This move is all about giving clients a smoother, more reliable Microsoft 365 and Teams experience … and it couldn’t come at a better time. By integrating Vantage DX, LDI Connect can now proactively monitor clients’ Microsoft environments.

How to Get Grafana Iframe Embedding Right

Adding Grafana dashboards directly into your app lets users see monitoring data without switching tabs or tools. Using an iframe to embed Grafana does work, but it brings along some tricky authentication and security issues that aren’t always obvious at first. In this blog, we’ll go over the practical ways to embed Grafana dashboards from easy public snapshots to secure, private dashboards that need authentication.

Optimize LangChain Performance with Trace Analytics

You’ve instrumented your LangChain app, and traces are now flowing into Last9. Now the issues are visible: API costs are crossing $200/day, average response times exceed 3 seconds, and performance degrades under 100 concurrent users. A single tool call adds over 2 seconds. Bloated context windows are pushing up token usage, wasting $50/day. Here’s how to use trace data to identify and fix these inefficiencies, systematically and at scale.

Generating end-to-end tests with AI and Playwright MCP

When I started using Playwright, there was a single command that blew me away. I immediately became (and still am) a huge Playwright Codegen fanboy. Playwright's codegen command opens up a browser window, and whatever you do in this window will be recorded. Navigating URLs, clicking links, and filling out form elements—the Playwright inspector records all your actions and generates a Playwright test for you. Magic!

The Fast Path to More Useful Telemetry

Over and over, we’ve seen that teams who invest in adding rich, relevant context to their telemetry end up debugging faster and collaborating more effectively during incidents. Getting meaningful context added can feel like a big cross-team project, but some of the highest-leverage improvements don’t require app code changes or coordination across services.

Observability as Code: Why You Should You Use OaC

Key takeaways In the fast-moving world of CI/CD pipelines, microservice architectures, and container orchestration, software changes rapidly. What exists in a codebase today might be gone next week. At this scale and speed, it’s impossible for development teams to manually track every line of code and every new piece of functionality.

Uptrace v2.0: The Future of Observability is Here

The Uptrace team is thrilled to announce the release of v2.0—our biggest update yet! This release represents a complete reimagining of how observability data should be stored, queried, and managed. With multi-project support, revolutionary JSON-based storage, powerful data transformations, and a host of developer-friendly features, Uptrace v2.0 is designed to scale with your growing infrastructure needs.

Microsoft SCOM Management Pack Housekeeping in Secured, Offline, or Air-Gapped Environments

MP Catalog Offline Toolkit by NiCE | 20min Walkthrough Struggling with Management Pack updates in restricted environments? Discover how the MP Catalog Offline Toolkit by NiCE simplifies SCOM MP management—without the need for an internet connection. Watch the 20-minute walkthrough now and see how this free tool helps your SCOM team stay compliant, efficient, and secure Download it now on GitHub – absolutely free, from the experts at NiCE.

Best on-call scheduling tools in 2025 [10 reviewed]

Managing developer on-call rotations and escalations isn't just about who gets woken up at 2 a.m. — it's about ensuring reliability, minimizing downtime, and scaling operational excellence. With so many tools out there, choosing the right on-call solution can be tough. We've analyzed 10 of the most trusted on-call scheduling platforms in 2025 — comparing usability, pricing, integrations, automation, and support — to help you choose the best tool for your engineering or DevOps team.

ManageEngine is recognized as a Strong Performer in 2025 Gartner Peer Insights Voice of the Customer for Digital Experience Monitoring

We're thrilled to announce that we have been recognized as a Strong Performer in the 2025 Gartner Peer Insights Voice of the Customer for Digital Experience Monitoring (DEM). We think that this recognition is a result of direct customer feedback on their experience with our solutions, underscoring the trust and value users associate with our solutions.

Enhancing authentication security: Inside Microsoft's open source contribution to Grafana

When Microsoft engineers went looking for a modern visualization platform to help track critical signals and make quicker decisions, Grafana emerged as the clear favorite. But there was just one hitch: the available authentication methods didn’t quite meet their needs.

Is Your Network Automation Strategy Already Obsolete?

You know the feeling. It’s that familiar rhythm of playing defense, racing from one network fire to the next. The alerts pile up, users report slowdowns, and your team of brilliant engineers spends its days tracing packets instead of focusing on the future. For years, automation has been the answer. You’ve built scripts and workflows to handle repetitive tasks, which has certainly helped.

Introducing DX NetOps Topology: What It Provides, How It Works

Networks aren’t what they used to be. While your network operations teams still have legacy equipment to manage, they’re also contending with the expanded reliance on software-defined networking (SDN), hybrid and multi-cloud architectures, private clouds, and more. These environments are anything but static. They’re sprawling, dynamic, and evolving faster than ever—which means that establishing and retaining visibility and control is more challenging than ever.

Here's the proof: What the fastest sites on the web have in common

60% of Gen Z won’t engage with a slow-loading website. In today’s digital economy, that’s a deal-breaker. Whether it’s a banking portal, a travel app, or an AI-powered SaaS platform, users expect performance. Instant loading, global reliability, and smooth interactivity aren’t just nice to have—they define the winners.

Elasticsearch with Python: A Detailed Guide to Search and Analytics

If you’re using Python for search, log aggregation, or analytics, you’ve probably worked with Elasticsearch. It’s fast, scalable, and fairly complex once you go beyond the basics. The official Python client gives you raw access to Elasticsearch’s REST API. But getting it to work the way you want, especially under load, can be tricky. This blog walks through practical ways to index, query, and monitor Elasticsearch from Python code, without getting lost in the docs.

Get started with Grafana Alerting: Create and receive your first alert

In this tutorial, we walk you through the process of setting up your first alert in just a few minutes. Don't miss the rest of the "Get started with Grafana Alerting" series! Each part dives into a different feature to help you get the most out of alerting in Grafana.

Datadog vs Jaeger - Features, Pricing & Use Cases [Updated for 2025]

Datadog and Jaeger are both leading tools in the observability space, but they represent two fundamentally different philosophies. Datadog is a commercial, all-in-one SaaS platform that unifies metrics, traces, and logs. Jaeger is a popular, open-source project focused specifically on distributed tracing. Choosing between them isn't just a technical decision; it's about balancing the convenience of a fully managed, integrated platform against the power and control of a self-hosted, specialized tool.

Why APM Is Essential for Microservices Architecture?

According to Statista, over 85% of large enterprises and nearly 50% of small to midsize businesses will have adopted microservices as part of their software architecture. The shift is clear: organizations of all sizes are moving away from monolithic applications toward microservices to accelerate development cycles, improve scalability, and support continuous delivery. But this architectural freedom comes with a hidden cost, which increases operational complexity.

Show your work: Prove your MSP value

It’s one of those unspoken facts of business that the better something is managed, the less people notice. Everything just glides along. That’s a double-edged sword for managed service providers. On the one hand, you want to give clients a frictionless experience. On the other, you want them to know that’s what they’re getting — and how much you put in to deliver it.

When Do You Really Need SNMP Device Monitoring?

In the world of network monitoring, SNMP is the tried-and-true protocol that’s been helping IT teams monitor device health for decades. Whether you're managing switches, routers, firewalls, or access points, SNMP Device Monitoring remains one of the most widely used methods for tracking device performance and status. At Obkio, one of the most common questions we hear from prospects is: “Do you offer SNMP device monitoring?” The short answer: Yes, absolutely.

Want to hear your users' complaints? There's a widget for that (now available on mobile)

A disappearing “Submit” button. A modal stuck half-offscreen. It's not a crash or a performance regression. Just broken UX. Frustrating enough to make users rage-quit or leave a 1-star review. Error and performance monitoring catch the technical stuff: crashes, bottlenecks, slow APIs. But they won’t tell you when a layout breaks, or a UI flow subtly unravels after a redesign.

What Is Hybrid Observability? A Healthcare IT Explainer

Healthcare IT environments have become incredibly complex. Think about everything running simultaneously in your organization: physical medical devices, cloud platforms, clinical applications like Epic, and patient-facing applications. Each component needs to work together seamlessly, much like how ICU monitors track multiple vital signs at once. Many healthcare organizations still use monitoring solutions designed for simpler times, when systems were more isolated.

Grafana Labs named a Leader again in the 2025 Gartner Magic Quadrant for Observability Platforms

We’re thrilled to share that Grafana Labs has been recognized as a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms—for the second year in a row. This year’s report placed Grafana Labs furthest in “Completeness of Vision,” which we believe reflects our deep commitment to building a truly open, composable observability stack that gives users flexibility, control, and the tools to own their observability strategy.

Elastic named a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms

Observability has an investigation problem, and dashboards and alerts aren’t enough for solving problems in today’s complex systems. AI-driven capabilities, powerful analytics, and the ability to scale are essential to drive real-time investigations while keeping costs low. We think this is why Elastic has been named a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms for the second time.

Beyond Metrics: How We Reimagined Incident Response with RUM

When your monitoring tools and logs tell you everything's fine, but users can't access critical healthcare services, where do you look? Our team discovered that Real User Monitoring (RUM) isn't just for tracking page load times and user journeys – it's a powerful incident response tool that can uncover issues traditional monitoring misses entirely.

Introducing the Hyperping Intercom Integration: Reduce Support Tickets with Proactive Status Communication

"Is our API down?" "Why can't I access the dashboard?" "Are you having server problems?" When incidents happen, support teams face a familiar nightmare: tickets flood in faster than you can respond. Your team scrambles to check system status and respond to dozens of identical questions while engineering focuses on fixing the actual problem.

Top 3 reporting tools for Microsoft Teams: SquaredUp, Power BI & M365 Admin Center

Microsoft Teams is a ubiquitous presence in workplaces all over the world. Prior to 2020, its usage was relatively moderate, with around 20 million users. However, global restrictions during the pandemic led to a 3,500% growth. Teams is now so central to business operations that Microsoft retired Skype in its favor. But this massive scale created a new problem – businesses needed better ways to monitor and report on their Teams usage.

Getting started with ElasticSearch dashboards

ElasticSearch is one of the IT and software industry’s most established platforms for storing and analyzing log data. As its name suggests it also has a powerful search and analytics engine based on the ElasticSearch Query language. ElasticSearch itself is essentially a backend store, so if you want to explore and analyze your data, you will need a visualization layer such as SquaredUp and our ElasticSearch PlugIn.

Visibility Is the First Line of Defense: Operational Readiness in a Zero Trust World

As global cyber threats continue to evolve at unprecedented speed, the United States public sector faces growing pressure to enhance operational readiness. Agencies must now contend with adversaries who are not only well-funded but also increasingly sophisticated in their ability to exploit visibility gaps. In the face of this dynamic threat landscape, the Zero Trust Architecture (ZTA) model has become an essential security framework.

How We Made Our Queries 99.5% Faster

We cut log-query scanning from ~100% of data blocks to < 1% by reorganizing how logs are stored in ClickHouse. Instead of relying on bloom-filter skip indexes, they generate a deterministic “resource fingerprint” (hash of cluster + namespace + pod, etc.) for every log source and sort the table by this fingerprint in the primary-key ORDER BY clause. This packs logs from the same pod/service contiguously, letting ClickHouse’s sparse primary-key index skip irrelevant blocks.

How to improve your observability

Coroot was designed to solve the problem of time-consuming root cause analysis. It handles the full observability journey - from collecting telemetry automatically with zero code setup (thanks, eBPF!) to simplifying the role of SREs and DevOps everywhere with instant root cause analysis powered by AI. We also strongly believe that simple observability should be an innovation everyone can afford to benefit from: which is why our software is open source!

Cloud Log Management: A Developer's Guide to Scalable Observability

As systems move to microservices, serverless, and multi-cloud setups, debugging gets harder. You’re no longer dealing with a single log file; you’re looking at logs from dozens of services, running across different environments. Traditional debugging methods like SSH-ing into servers or adding print statements don’t scale in these environments. Cloud log management tools help by collecting logs from all your services into one place.

What is Log Loss and Cross-Entropy

You're building a classification model, and your framework throws around terms like "log loss" and "cross-entropy loss." Are they the same thing? When should you use binary cross-entropy versus categorical cross-entropy? What about focal loss? This blog breaks down these loss functions with practical examples and real-world implementations.

Datadog named Leader in 2025 Gartner Magic Quadrant for Observability Platforms

We are thrilled to announce that, for the fifth consecutive year, Datadog has been named a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms. We believe that this recognition reflects our continued focus on helping customers observe, secure, and act on everything that matters across their technology stack.

What Are Traces? A Developer's Guide to Distributed Tracing

One of the most common challenges in modern software engineering today is understanding how requests flow through applications. As system architectures shift to favor widely distributed, cloud-native designs, keeping track of how an application processes user actions is more difficult than ever. A single user action may trigger events processed in dozens of backend services. Traces are helping software developers today with this challenge.

The Inconvenient Truth About AI Ethics in Observability

Let's be honest: most conversations about AI ethics sound like they're happening in a boardroom, not an ops room. But here's the thing, when you're using AI to make sense of your telemetry data, ethics isn't some abstract concept. It's the difference between insights you can trust and algorithmic noise that leads you down the wrong path. The uncomfortable reality? Your AI is only as ethical as the messiest, most biased piece of telemetry data you feed it. And if you think your data is clean, well...

Grafana Labs is a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms

For the second year in a row, Grafana Labs has been named a Leader in the Gartner Magic Quadrant for Observability Platforms — and this year, we’re proud to be recognized as the furthest in Completeness of Vision. In this video, Grafana Labs CTO Tom Wilkie shares what this recognition means, why our scores for execution and vision both improved, and how it reflects years of building a truly open, composable observability stack.

Coralogix | Magic Quadrant 2025

Today marks an exciting moment for all of us at Coralogix. We’re proud to share that Gartner has named us a Visionary in the 2025 Magic Quadrant for Observability Platforms. This recognition, we believe, reflects what we’ve been building toward for years: an observability platform that delivers scale, cost-efficiency, AI-powered insights, and tangible customer success.

Here's how to add business data to logs from retail endpoints | Datadog Tips & Tricks

Some sources simply do not generate data-rich logs. Retail endpoints that are older or run on proprietary services, for example, very often produce logs without the kinds of data that are needed to perform useful business analytics. So, what can you do?

Practical guide to implement and succeed in configuration and change management

In an era where networks are the arteries of every enterprise, ensuring they run smoothly is non-negotiable. From small branch offices to sprawling data centers, a single misstep in configuration can trigger costly downtime, security breaches, and compliance headaches.

OpenTelemetry Collector: A Complete Guide [2025]

The OpenTelemetry Collector is a stand-alone service that acts as a powerful, vendor-neutral pipeline for your telemetry data. It can receive, process, and export logs, metrics, and traces, giving you full control over your observability data before it reaches a backend. This guide will provide a comprehensive overview of the OpenTelemetry Collector, its architecture, deployment patterns, and how to configure it for production use.

Kubernetes Monitoring 101: 25 Tools And Must-Know Tips

The Kubernetes platform is the standard for orchestrating containerized applications. It’s ideal for large applications running on distributed instances. However, monitoring Kubernetes infrastructure can be notoriously challenging. This guide will cover Kubernetes monitoring in more detail, including what metrics to track to improve visibility and control over your K8s containers, apps, microservices, etc.

Not all monitoring sees what your users are seeing.

APM tools are great, but they have blind spots; they do not monitor from where your users actually are. There’s a gap between lab-perfect APM tests and real-world experience. There’s a lot that can degrade performance between your cloud environment and your users. If you’re not monitoring that path, you’re missing critical context.

Observability's Moneyball Moment: How AI Is Changing the Game (Not Ending It)

‍ We're not witnessing the end of observability, we're witnessing its evolution into something far more powerful. The observability industry is having its Moneyball moment. Just like Billy Beane revolutionized baseball by using data analytics to compete with teams that had vastly larger budgets, observability is undergoing a fundamental transformation.

Honeycomb Users Are Living in the Future, Part 1: Sampling

When we talk to new Honeycomb users, a few things stand out as sounding downright magical. Sometimes we’ll hear, “Wow, is that a new feature?” and we’ll say that no, it’s been like that for years. Clearly we need to get the word out! This is the first installment of a blog series I’ll be writing, covering areas of Honeycomb that elicit reactions of awe and disbelief from new users.

Lumigo Launches AI Agent Observability

LLM-powered agents are reshaping software, but when they fail, troubleshooting is guesswork. Lumigo’s new AI Agent Observability, now in beta, gives you visibility into the entire lifecycle of your agents, from prompt to response to internal decision logic. Built for modern AI workloads, this feature is designed to help engineers monitor, debug, and optimize agents running on platforms like OpenAI, Anthropic, and open-source models.

Top 5 MSP takeaways from the 2025 IT Trends Report

Earlier this year, Auvik released our annual IT Trends Report, spotlighting some of the key changes for network management, MSP, and IT practitioners. We know the market and its ups and downs can have a huge impact on the success of MSPs, so we’re bringing you a roll-up of key statistics and findings related to MSP specifically. Read on to see what we found.

How to Get Logs from Docker Containers

When a container misbehaves, logs are the first place to look. Whether you're debugging a crash, tracking API errors, or verifying app behavior—docker logs gives you direct access to what's happening inside. This blog covers the full workflow: how to retrieve logs, filter them by time or service, and set up logging for production environments.

Troubleshooting LangChain/LangGraph Traces: Common Issues and Fixes

We’ve covered how to get LangChain traces up and running. But even when everything’s instrumented, traces can still go missing, show up half-broken, or look nothing like what you expected. This guide is about what happens after setup, when traces exist, but something’s off.

Troubleshoot root causes with GitHub commit and ownership data in Error Tracking

When an error occurs, developers need to act quickly. But too often, they’re left searching through stack traces without enough context to understand what happened, who owns the code, or what change may have introduced the issue. This slows down triage, creates inefficient handoffs, and takes time away from building new features.

Monitor your LiteLLM AI proxy with Datadog

As organizations rapidly scale their use of large language models (LLMs), many teams are adopting LiteLLM to simplify access to a diverse set of LLM providers and models. LiteLLM provides a unified interface through both an SDK and proxy to speed up development, centralize control, and optimize LLM-powered workflows. But introducing a proxy layer adds abstraction, making it harder to understand how requests are processed.

Reduce your mean time to repair with the Datadog mobile app

For on-call engineers responding to alerts, every minute counts. Faster incident response means faster mitigation, reduced downtime, and better customer experience. But even the most finely tuned, meticulously detailed alerts can leave responders scrambling for more information. In order to effectively triage and investigate incidents and set remediation in motion, responders need data to help them contextualize alerts.

How to turn logs into metrics with Grafana Loki (Loki Community Call July 2025)

Cyril Tovena shows us how to turn logs into metrics with Grafana Loki using metric queries in LogQL. What do you do when all you have are logs, but you want to count them, aggregate them, or parse them for numbers you want to graph? Well, there's a query for that! Cyril is joined by Jay Clifford and Nicole van der Hoeven to discuss everything you need to know about metric queries and how to use them to get numbers out of Loki.

Dashboard Sharing - The Hard Way

Unlike menu items, dashboards in Icinga Web 2 currently can’t be shared across users. This is something we will implement in future versions, but for now users can only create dashboards for themselves. We don’t have an exact timeline for the dashboard sharing feature yet and our roadmap is already pretty packed for this year, so we won’t be tackling this until later next year.

Introducing the InfluxDB 3 MCP Server: Natural Language for Time Series

Time series data underpins all real-time systems. From high-resolution telemetry to long-range trends, it’s essential for monitoring, automation, predictive maintenance, and operational insight. But it’s also hard to work with: high cardinality, shifting schemas, and time-based queries make even basic tasks feel heavy.

What You Actually Need to Monitor AI Systems in Production

You did it. You added the latest AI agent into your product. Shipped it. Went to sleep. Woke up to find it returning a blank string, taking five seconds longer than yesterday, or confidently outputting lies in perfect JSON. Naturally, you check your logs. You see a prompt. You see a response. And you see nothing helpful. Surprise. Prompt in and response out is not observability. It is vibes.

Observability for containerized workloads: How to run Grafana Beyla as a sidecar in Amazon ECS

Note: Grafana Beyla has been donated to OpenTelemetry under the new project name OpenTelemetry eBPF Instrumentation. Beyla will continue to exist as Grafana Labs’ distribution of the upstream project. Grafana Beyla is an open source eBPF-based auto-instrumentation tool that helps you easily get started with application observability, allowing you to monitor and visualize traces without modifying the application code.

How OutboundSync Improved Transparency with StatusGator

OutboundSync, a powerful platform that helps marketers sync outbound sales data to CRMs like HubSpot and Salesforce, knows that transparency is key component of delivering a service that hundreds of teams rely on. And as an integration platform, OutboundSync is deeply reliant on other providers, making vendor reliability a key part of their own transparency.

Getting started with VMware dashboards

VMware is a leading platform for virtualization and cloud infrastructure, widely used to manage compute, storage, and networking resources across on-premises and hybrid environments. While it offers powerful capabilities and extensive telemetry through tools like vCenter, navigating this data can be overwhelming – especially when trying to spot performance issues, capacity trends, or VM sprawl in real time. That’s where a solution like SquaredUp can make a significant difference.

Customizing your Azure DevOps DORA metrics dashboard

Looking to configure and customize a DORA metrics dashboard? Our Director of Engineering Services, Tim Wheeler, demonstrates how to customize the DORA Metrics dashboard in Azure DevOps for SquaredUp. He shows how to populate key metrics like deployment frequency and change failure rate by selecting a pipeline, specifically the Squared Up multi-stage pipeline.

Notes from the Field: Seamless SSO 404s Impacting Citrix on Windows Server 2025

As a Citrix consultant, not every issue I troubleshoot is directly tied to Citrix, but many of them dramatically impact the end-user experience. This is one of those cases. A customer had begun testing Windows Server 2025 as Multi-Session hosts in their environment. The new servers were domain-joined and fully patched, and they expected a smooth experience with Office 365, Entra ID–backed apps, and cloud-based authentication. Everything had worked flawlessly on Server 2022.

How to Block an External Attack with FortiGate and Progress Flowmon ADS

It’s a question we hear often - how do we use the Progress Flowmon solution to block an attack? Flowmon is not an inline appliance that stands in the path of inbound traffic, so we partner with third-party vendors who supply equipment such as firewalls or unified security gateways. In this post, we’re going to show you how to instruct Fortinet’s firewall FortiGate via Flowmon ADS to block traffic in response to a detected anomaly or attack.

The Real Business Value of Time Series Database

Time series data powers nearly every modern system, from industrial equipment and energy grids to financial platforms and digital services. Devices and software continuously generate streams of time-stamped metrics that reflect how systems perform moment to moment. Most businesses collect this data, but far fewer utilize its full potential. Storing information and reviewing dashboards offers limited value.

Ensure the availability of critical services with the Opslogix Core Windows Service Management Pack

Ensure the availability of critical services with the Opslogix Core Windows Service Management Pack In a typical SCOM environment, a lot of the Management Packs are designed to monitor services tied to a specific technology, such as SQL Server, IIS, or the Windows operating system itself. But what about services that don’t belong to any particular application but are essential across all servers?

Enforce configuration standards with the Opslogix Compliance Management Pack

Enforce configuration standards with the Opslogix Compliance Management Pack Maintaining compliance is not just a matter of policy, it is a matter of operational stability and security. But with so many moving parts, configuration drift is almost inevitable. The Opslogix Compliance Management Pack helps identify these deviations by continuously verifying key system configurations and alerting when they fall out of alignment.

From Weeks to Hours: How Technical Teams Are Driving Fast ROI

Speed is no longer a luxury in IT operations—it’s a requirement. When systems falter, alerts spike, or new services go live, time becomes the most valuable resource. And yet, many IT teams are still shackled to tools and processes that take weeks—or months—to show measurable value. The question technical leaders increasingly ask is: How fast can we get value? Not just dashboards. Not just data.

Improve Consistency Across Signals with OTel Semantic Conventions

It’s 2 AM. Your API is timing out. Logs show a slow query. Metrics flag a spike in DB connections. Traces reveal a 5-second delay on a database call. But then the questions start:- Which database?- Does the query match the delay?- Why doesn’t this align with the connection pool metrics? Each tool uses different labels, db.name, database, sometimes nothing at all. Without a shared schema, connecting the dots is slow and frustrating.

How Replicas Work in Kubernetes

Replicas in Kubernetes control how many copies of your pods run simultaneously. They're the foundation of scaling, availability, and recovery in your cluster. When you're running a stateless API or a background worker, understanding how replicas work directly impacts your application's reliability and performance. This blog walks through replica management, from basic concepts to production monitoring patterns that help you maintain healthy, scalable applications.

See System Logs Alongside your Metrics Using Loki, Grafana, and Graphite

In this quick demo, we show how you can transform logs collected by Grafana Loki into actionable Graphite metrics using MetricFire. Watch as we convert structured logs into performance insights. Perfect for teams looking to bridge the gap between logging and monitoring. This workflow helps you move beyond basic log storage and turn raw logs into meaningful metrics for alerts, dashboards, and capacity planning.

An open-source SDK for finding dead code

Writing code is easier than ever. We want to make deleting code just as easy – introducing Reaper for iOS and Android. Reaper was an Emerge Tools product that helped companies like Duolingo delete 1% of their iOS codebase. And just like with Emerge Tools’ Launch Booster, we’re making Reaper open-source for anyone to use. In this post, we’ll explain what Reaper is, why you should care about dead code, and how Reaper works on both platforms.

How a Fortune 500 Company Eliminated 93% of IT Incidents in 72 Hours

Sometimes the biggest transformations begin with what sounds like the worst possible news. One day, this Fortune 500 technology company’s observability platform was running smoothly. The next, they learned their critical monitoring solution would be discontinued as part of a corporate buyout. For a leading global IT vendor in data infrastructure serving customers across storage, cloud, and managed services, this was a potential catastrophe.

Observability in under 5 seconds: Reflecting on a year of grafana/otel-lgtm

With grafana/otel-lgtm, observability is just one Docker command away. Over the past year, grafana/otel-lgtm has simplified observability setups, helping developers get a complete OpenTelemetry stack running in under five seconds. With integrations for metrics, logs, traces, and now profiles via Grafana Pyroscope, it has become a go-to solution for demos, development, and testing, as evidenced by its growing community (1k stars on GitHub and growing!) and notable adopters.

How They Handle 44 Million Searches a Day...Without Breaking! | Rightmove and Elastic

Rightmove, the UK's number one property search, and buying and selling platform has trusted Elastic for more than 11 years. Hear Andrei Nicusan, Principal Engineer at Rightmove on why Elastic has been Rightmove's number one Search and Observability solution for more than a decade. And now with the move to Elastic Cloud and Google Cloud Platform, you can find out how Rightmove are taking advantage of reductions in their infrastructure overheads too!

Introduction to Kafka Scaling Challenges

Apache Kafka has become the go-to platform for organizations handling high-throughput, real-time data streaming. Its ability to manage massive data volumes while ensuring reliability is second to none. However, as businesses grow and demand for data increases, scaling Kafka isn’t always a walk in the park. It often comes with its own set of challenges that can throw even the most seasoned teams for a loop.

We now support Google Chat

I'm pleased to share that we've can now notify you via Google Chat. Here's what that looks like: Our Google Chat notifications include: You can read more on how to set up Google Chat notifications in our docs. Of course, we also offer numerous other channels to notify you when something is wrong with your site. I'm pleased to share that we've can now notify you via Google Chat.

Introducing MetricFire Logging: Visualize Logs Alongside Metrics

As modern infrastructure grows more dynamic and distributed, collecting logs alongside metrics becomes a critical part of any observability strategy. To make this easy and powerful, MetricFire now supports a direct logging pipeline using Grafana Loki. This allows you to forward system logs from your servers to Hosted Graphite's Loki backend and visualize them in your Hosted Grafana dashboards with full control over queries, filtering, and alerting.

The Defense-in-Depth Approach To Application Monitoring

In cybersecurity, defense-in-depth is a fundamental principle – you never rely on a single security measure to protect your systems. The same philosophy applies to application monitoring. No single monitoring approach, no matter how sophisticated, can capture every possible failure mode of your application. This is why layered monitoring isn't just a best practice – it's essential risk mitigation.

Announcing Checkly Uptime Monitors: Simple, Scalable, and Built for Developers

When Checkly launched, it was the first of its kind, enabling developers to monitor complex workflows easier than ever using the automation tooling (Playwright, Terraform, etc) they already knew and loved. We’ve helped detect and resolve issues for 1000s of companies—ranging from monitoring crucial log-ins, to purchasing products, to setting up client instances for millions of monthly users But what about the simpler stuff?

Global API downtime increases by 60% in 2025, new data shows

London, 8 July 2025: Global API downtime increased by 60% in Q1 2025 compared to Q1 2024, shows new data from web service monitoring provider Uptrends, part of ITRS’ comprehensive observability platform. The State of API Reliability 2025 report — based on over 2 billion API monitoring checks across 20 industries in Q1 2024 and Q1 2025 — reveals a year-on-year drop in average API uptime from 99.66% to 99.46%, representing a decline of 0.2%.

Coralogix Expands AWS Partnership to Deliver AI-Driven Observability and Edge Threat Detection

Coralogix is proud to announce a new phase in its partnership with AWS through a Strategic Collaboration Agreement (SCA) focused on bringing AI-powered observability and security to the enterprise. At the heart of this collaboration is Amazon Bedrock, AWS’s managed service for foundation models.

Introduction to Apache Kafka Scaling Challenges

Apache Kafka has become the go-to platform for organizations handling high-throughput, real-time data streaming. Its ability to manage massive data volumes while ensuring reliability is second to none. However, as businesses grow and demand for data increases, scaling Apache Kafka isn’t always a walk in the park.

Bringing Intelligence and Automation Together to Change the Shape of Work

The aspirational target state for a cognitive system is to “take responsibility” for a domain (e.g., an autonomous car). To reach that level of sophistication, the system must achieve high levels of maturity simultaneously along two dimensions: Reasoning ability and Automation ability.

Comparing The Top 9 Datadog Alternatives and Competitors in 2025

The rising costs and complexities of monitoring cloud infrastructure are pushing many organizations to explore alternatives to Datadog. With monthly bills sometimes reaching thousands of dollars and feature sets that can be overwhelming, teams are looking for practical, cost-effective solutions that better fit their needs.

From chaos to clarity with Grafana dashboards: How video game company EA monitors 200+ metrics

To be a successful gamer, you have to think strategically and creatively. Working as a software engineer at Electronic Arts (EA), a top video game company, requires the same skills. That’s especially true when it comes to monitoring the EA app, which is the launcher for EA games and used by hundreds of millions of people around the world.

Instrument LangChain and LangGraph Apps with OpenTelemetry

In our previous blog, we talked about how LangChain and LangGraph help structure your agent’s behavior. But structure isn’t the same as visibility. This one’s about fixing that. Not with more logs. Not with generic dashboards. You need to see what your agent did, step by step, tool by tool, so you can understand how a simple query turned into a long, expensive run.

Prometheus Group By Label: Advanced Aggregation Techniques for Monitoring

Your Prometheus dashboard shows 847 CPU metrics. The alert fired—but is the problem in us-east or us-west? You're trying to rule out whether that new feature caused a latency spike, but the sheer number of time series isn’t helping. Grouping can make this manageable. By organizing metrics by shared label values, you can quickly spot which service or region is behaving differently, without digging through every metric.

Best Network Monitoring Tools of 2025

Keeping tabs on your network has never been more important. Whether you’re running a small business or managing infrastructure across cloud environments, visibility into what’s happening behind the scenes is essential. But visibility alone isn’t enough…when something breaks, the IT engineer needs to know immediately, so they can take action and resolve critical issues.

Getting Started with AI Agent Monitoring From Sentry

Sentry has released AI Agent monitoring, and in this video you can see the fast path to getting started with it using the Vercel AI SDK and Anthropic Claude. AI Agent Monitoring uses tracing to let you see details around how AI interactions are happening inside your application. You can see the back and forth conversation flow, token usage, model usage, durations, and much more. Agent Monitoring is out now, take it for a spin, let us know what you think in Discord!

How to Simplify AI Observability Across Hybrid and Cloud Environments

As companies adopt more artificial intelligence (AI) to stay competitive and simplify operations, they’re hitting a snag they’ve seen plenty of times before: complexity. Those user-friendly chatbots and impressive predictive models aren’t magic—they run on powerful GPUs like NVIDIA’s and rely on cloud services such as Azure OpenAI or Amazon SageMaker.

Get structured visibility across network devices with device templates

Manually mapping object identifiers (OIDs) for every network device? Struggling to make sense of hundreds of SNMP metrics? Site24x7’s device templates give you a smarter, more scalable way to monitor routers, switches, firewalls, and more—without manual guesswork. In this video, we’ll walk you through how to use device templates in Site24x7 to get actionable insights into your network performance.

How to troubleshoot Kubernetes issues using Events | Site24x7 Kubernetes Monitoring

Troubleshooting Kubernetes just got easier. In this video, we walk you through how to use Kubernetes Events in Site24x7 to quickly detect, analyze, and resolve issues like CrashLoopBackOff, ImagePullBackOff, Evicted pods, and more without the guesswork. Learn how to: With Site24x7 Kubernetes Monitoring, you get full observability—right down to every critical event in your cluster.

Running #playwright Tests in Multiple Environments with Checkly. #sdet #devops

Learn how to efficiently run Playwright tests across different environments without rewriting them. This tutorial covers managing environment variables in Checkly for API and browser checks, handling global and group-specific settings, and integrating with CI/CD processes. Discover the best practices for setting up environment variables, duplicating test groups, and customizing alerts to ensure your checks are environment-specific.
Sponsored Post

The Agentic Network: How AI Agents Are Transforming Infrastructure from Liability to Living Intelligence

Modern enterprises depend on networks that are increasingly complex, dynamic, and opaque. Yet, instead of confronting this complexity head-on, most organizations fall into the trap of superficial control, layering more monitoring tools atop their stack in hopes of achieving resilience. In reality, this only fragments visibility, deepens operational silos, and leaves a crucial layer of the digital enterprise, the network, under-managed and misunderstood.

What is UDP Packet Loss & How to Monitor It

Have you ever found yourself scratching your head over UDP packet loss? You're not alone. UDP (User Datagram Protocol) is a go-to for streaming, gaming, and VoIP, but when packets start going AWOL, it can spell trouble for your network's performance. Imagine you're in the middle of an important VoIP call, a critical online gaming session, or a live video stream, and suddenly things get choppy or laggy—that's UDP packet loss rearing its ugly head.

Best Practices for Planning for Upcoming Cloud Maintenance

Cloud maintenance is a common practice in the tech industry. Whether you manage your own infrastructure or use a cloud provider, you will need to plan for maintenance and include it as part of your operational readiness. This ensures that your team is prepared for potential downtime and can deal with any incidents in a timely manner. This article will cover some best practices for planning for upcoming cloud maintenance.

Silent Downtime: The Hidden Cost of Delayed Awareness in Banking

Ask banking leaders if their systems are healthy, and most respond confidently: “Yes, everything’s up.” But track a transaction closely, and reality shifts. A high-value payment retries repeatedly before settling. A KYC process silently times out, losing a verified customer. Compliance checks complete using stale data. No visible outages. Yet silent failures accumulate, becoming costly and increasingly damaging. This is downtime that dashboards never flag.

Docker Status Unhealthy: What It Means and How to Fix It

If your container shows Status: unhealthy, Docker's health check is failing. The container is still running, but something inside, usually your app, isn’t responding as expected. This doesn’t always mean a crash. It just means Docker can’t verify the app is working. Here’s how to debug the issue and restore the container to a healthy state.

Introducing UptimeRobot's official Terraform provider

We’re excited to announce the official release of the UptimeRobot Terraform provider, a feature that many of you have been requesting. Starting today, you can manage your UptimeRobot resources, including monitors, alerting integrations, maintenance windows, and public status pages, directly in your Terraform configuration. Let’s take a closer look.

How AI-driven Anomaly Detection Fortifies Compliance in Multi-Cloud Infrastructures

In a multi-cloud environment, each cloud platform brings its unique tech stack to record events, manage services, set up configurations, manage user access and permissions, etc. While this allows you to leverage the best-of-breed services from different cloud vendors, the complexity of this setup makes it challenging to detect and respond to anomalies across clouds in real-time.

Want recurring revenue? Deliver value - and prove it.

Recurring revenue streams bring stability to any business. For MSPs, that stability can be an essential foundation for service innovation and pursuing growth. Offering subscription-based managed services is an obvious way to get recurring revenue in place. Collaboration platforms like Microsoft Teams and Zoom are prime candidates for subscriptions since businesses depend on them every day, at all levels of the organization, to be productive and profitable.

Application Performance Monitoring (APM) Use Cases Every DevOps Team Should Know

Modern applications are built using distributed architectures, microservices, and cloud-native technologies. As these systems grow in complexity, it becomes harder for DevOps teams to maintain performance, track issues, and ensure a consistent user experience across all environments. Application Performance Monitoring (APM) helps solve these challenges by providing real-time visibility into how applications behave, from user interactions to backend services and infrastructure.

Why SaaS Infrastructure Monitoring Is Critical for Modern IT Operations

In today’s cloud-driven world, keeping track of your software services is no longer optional—it’s essential. SaaS infrastructure monitoring helps IT teams keep an eye on the performance, uptime, and health of all cloud-based applications and systems in real time. With businesses relying heavily on remote tools and digital platforms, monitoring your SaaS stack ensures smooth operations and quick issue resolution.

How to Monitor MPLS Networks

If you manage an enterprise network, then you’ve definitely come across MPLS. Although many businesses rely on MPLS technology for large, high performing networks, they can suffer from network problems, like network congestion, that can impact user experience. Monitoring MPLS using a Network Monitoring tool is key to identifying and solving network issues that impact MPLS performance.

Observability isn't about the tool. It's about the truth

An enterprise client reports latency. Your dashboards say everything is fine. They blame you. You blame them. Nobody can prove it either way. This is where most monitoring efforts hit a wall. Too often, the conversation gets stuck on dashboards and tools instead of the one thing that really matters: truth. Observability isn’t about collecting metrics or building pretty dashboards.

LangChain Observability: From Zero to Production in 10 Minutes

LangChain apps are powerful, but they’re not easy to monitor. A single request might pass through an LLM, a vector store, external APIs, and a custom chain of tools. And when something slows down or silently fails, debugging is often guesswork. In one instance, a developer ended up with an unexpected $30,000 OpenAI bill, with no visibility into what triggered it. This blog shows how to avoid that using OpenTelemetry and LangSmith. With this setup, you’ll be able to.

Introducing DNS Monitoring - Stay Ahead of DNS Issues Before They Impact You

We’re excited to announce a powerful new addition to your monitoring toolkit: DNS Monitoring is now available on UptimeRobot! DNS (Domain Name System) is a core component of internet functionality. When DNS records are misconfigured, hijacked, or simply expire, they can lead to serious outages, broken email services, or even security risks. That’s why we’ve introduced DNS Monitoring – to help you stay in control of your domain’s health at all times.

IT Event Console: Centralize Logs, Correlate Alerts, and Detect Incidents

When you’re just starting out, you might picture yourself managing your IT infrastructure like Tom Cruise in Minority Report—key information projected in front of you, predicting events before they happen, controlling everything at the speed of thought with cinematic gestures on some kind of holographic computer.

Top Automation Use Cases for IT (in End User Computing)

As digital transformation continues to reshape the business landscape, IT teams are under more pressure than ever. Organizations demand faster service, always-on support, and seamless user experiences – all while IT budgets remain stagnant or even shrink. Organizations urgently need solutions that help them keep up without burning out their teams or inflating costs. This is where IT automation becomes essential.

From Zero to Dashboard in 10 Minutes with Telegraf, InfluxDB 3, and Grafana

In this tutorial, let’s walk through setting up a modern TIG stack in 10 minutes. TIG stands for three popular open source tools that complement each other: Telegraf, InfluxDB 3, and Grafana. They are often used to collect, store, and visualize time series data from servers, containers, APIs, or even IoT devices. We will be using a read-to-use GitHub repository that includes.

The Business Case for Network Automation: Cost Savings and Efficiency

Let’s get real: the cost of not automating your network operations is probably already showing up on your P&L, and not in the column you like. Manual configuration changes, ad hoc backups, and frantic compliance prep aren’t just operational headaches, they’re quiet killers of budget flexibility and scale readiness. Network automation is no longer a “nice to have” for companies with massive IT budgets or unicorn-level engineering teams.

APM best practices: Dos and don'ts guide for practitioners

Application performance management (APM) is the practice of regularly tracking, measuring, and analyzing the performance and availability of software applications. APM helps you get visibility into complex microservices environments, which can overwhelm site reliability engineering (SRE) teams. The generated insights create an optimal user experience and achieve desired business outcomes.

Choosing the Right APM Software: 5 Key Factors to Consider

When applications slow down, users leave, and engineering teams scramble. Whether you're troubleshooting a spike in response times or chasing down intermittent backend failures, Application Performance Monitoring (APM) provides the visibility you need to detect, diagnose, and resolve performance issues before they impact your users or business goals. For engineers, APM isn’t just a convenience - it’s essential. But not all APM tools are created equal.

How we created a single app to automate repetitive tasks with Datadog Workflow Automation, Datastore, and App Builder

For many organizations, scaling up their systems means incorporating new tools to build out infrastructure, optimize code performance and security, improve communication, and track cost changes. While these changes are necessary to support an increased workload, they often result in a situation where even the most basic tasks involve switching between multiple platforms.

Maximizing Uptime: How to Monitor Network Ports

Keeping critical services running smoothly starts with visibility, and that begins at the port level. Whether you're managing a lean environment or a complex network infrastructure, knowing which ports are active, listening, or down can make or break your response time. In this video, we walk through how to fully configure port discovery and monitoring in SL1. You'll learn how to track availability, respond to port failures with automated alerts, and ensure your systems are always one step ahead of potential issues.

Opsgenie is shutting down: Complete guide to alternatives in 2025

Atlassian just pulled the plug on Opsgenie. On December 3, 2024, they announced that Opsgenie will reach end-of-life by April 2027. New sales stopped on June 4, 2025, and if you're using the JSM-bundled version, you'll lose access even sooner—October 2025. Here's the kicker: Atlassian wants you to migrate to their fragmented JSM + Compass combo, which splits your incident management across multiple tools. The reality? Teams are frustrated.

InfluxDB 3 Core: a complete rewrite designed for speed and simplicity

InfluxDB has been a popular time series database for the better part of a decade, and the latest release represents years of work behind the scenes to address several major feature requests users have been asking for since the earliest days of the time series database.

AI-Powered Monitoring with Checkly

Most monitoring tools weren't built for the AI-first world. By nature, traditional monitoring platforms force you out of your natural coding environment and trap you in clunky web interfaces, brittle configuration panels, and rigid APIs. And sadly, when monitoring providers do offer "AI features," it's usually a chatbot bolted onto their existing UI, being nothing more than a pale imitation of the AI tools you’re reading about every day on Hacker News. All this creates friction.

Why Healthcare IT Can't Keep Relying on Legacy Monitoring

Supporting every hospital chart, scan, and bedside alert is a web of digital systems—EHRs, lab interfaces, clinical apps, networks, and connected devices—all working in sync or struggling to. When something slips, say, an Epic interface queue backs up and lab results don’t reach the attending physician on time, the consequences aren’t theoretical. That delay might mean a sepsis alert gets missed. A treatment window closes. A patient’s outcome changes.
Sponsored Post

Introducing Raygun CLI: Level-up your error tracking workflow

Raygun CLI is a powerful command-line interface tool designed to enhance the developer experience when working with Raygun's error tracking and performance monitoring platform. With this tool, we bring Raygun's features directly to your terminal, making it easier to integrate some important elements of Raygun Crash Reporting and error tracking into your development and CI/CD workflow. We are excited to announce the release of version 1.0.0 of Raygun CLI.

LangChain & LangGraph: The Frameworks Powering Production AI Agents

Your AI agent worked flawlessly in development, with fast responses, clean tool use, and nothing out of place. Then it hit production. A simple "What's our pricing?" query triggered six API calls, took 8 seconds, and returned the wrong answer. No errors. No stack traces. Unlike traditional systems, AI agents don't crash, they drift. They make poor decisions quietly, and your monitoring says everything's fine.

How to Run Elasticsearch on Kubernetes

Elasticsearch stands as one of the most robust open-source search engines available today. Built on Apache Lucene, it handles complex search operations, real-time analytics, and large-scale data processing with impressive speed and accuracy. Kubernetes has transformed how we deploy and manage containerized applications. This orchestration platform automates deployment, scaling, and operations of application containers across clusters of hosts.

Close the gaps in your SCOM monitoring with the Opslogix Autonomous Windows Service Management Pack

Close the gaps in your SCOM monitoring with the Opslogix Autonomous Windows Service Management Pack SCOM offers strong monitoring capabilities, which is extended through its various Management Packs. However, a common challenge is that some Windows services goes unmonitored, simply because they don’t belong to a specific Microsoft technology like SQL Server or IIS.

A little love for two old fellas - Icinga Business Process Modeling and Icinga Web Graphite Integration

Today is the day, we grant two products their long overdue maintenance. Maintenance always sounds boring, I hear you. But let me remind you that this also means we do and take care! And what this actually is all about: Now let’s see what each release offers!

The Complete Guide to APM Best Practices for Developers, DevOps & SREs

Application Performance Monitoring (APM) is no longer optional, it is essential for delivering fast, reliable, and seamless digital experiences. But simply installing an APM tool isn’t enough. To truly know its potential, IT teams need to follow APM best practices. Best practices for APM refer to the most effective ways to monitor, analyze, and optimize your application’s performance using APM tools.

Introducing Netdata Insights

Subscribe to the channel → / @netdata Now in research preview: Netdata Insights The problem: Incident? You're jumping between dashboards, piecing together timelines. Reporting? You're copy-pasting charts and correlating trends by hand. The data’s there, but turning it into a narrative doesn’t scale. The solution: Netdata Insights. Synthesizes high-fidelity telemetry using the latest LLMs into AI-powered reports with natural-language explanations, visuals, and clear recommendations.

Netdata: The Fastest Path to Full Stack Observability. AI Powered.

Netdata is a real-time, high-performance and on-premises observability platform designed to monitor metrics and logs with unparalleled efficiency. Netdata requires zero-configuration to get started, and provides alerts, anomaly detection and AI assisted troubleshooting out of the box, providing a powerful and comprehensive infrastructure monitoring experience. Netdata is known for its distributed design. Instead of funneling all data into a few central databases like most traditional monitoring solutions, Netdata processes data at the edge, keeping it close to the source.

Debug smarter with Session Replay in Site24x7 real user monitoring (RUM)

Frontend errors can be tricky to trace without context. Site24x7's Session Replay gives developers, SREs, and DevOps teams complete visibility into the user journey by capturing every click, scroll, and interaction as it happened. With visual replays and correlated performance data, you can quickly identify what went wrong, why it happened, and how to fix it—without relying on user screenshots or log reports.

Effortless customer monitoring with Site24x7's MSP Customer Health View

As a Managed Service Provider, staying on top of your customers’ monitor statuses shouldn't be a hassle. With Site24x7's Customer Health View, you get a centralized, real-time summary of every customer account you manage. Access monitor statuses, alarm counts, and overall account health—all in one place. Switch between List View and Grid View, apply filters to prioritize issues, and let auto-refresh keep you up to date every five minutes.

Dynamic Status Pages on Demand

Clients expect transparency - especially when things go wrong. But manually updating a status page during an incident or maintenance window slows you down when speed matters most. Oh Dear’s status pages are more than just a pretty uptime dashboard. They’re fully API-driven and designed to scale with your workflow. Whether you manage five client sites or five hundred, you can create, update and sync status pages as needed. Here’s how to do it.

Robust Time Series Monitoring: Anomaly Detection Using Matrix Profile and Prophet

Monitoring production systems often feels like searching for a moving needle in a constantly shifting haystack. At Sentry, our goal was to empower customers to move beyond traditional threshold and percentage-based alerting. We aimed to help them detect subtle and complex anomalies in their systems in near real-time. This post will detail how our AI/ML team developed a time series anomaly detection system using Matrix Profile and Meta’s Prophet.

What is a Jitter Buffer and How It Works

If you've ever been on a choppy VoIP call or sat through a video meeting where people sounded like robots from the ‘90s, you’ve likely run into a little thing called jitter. It’s one of those sneaky network issues that doesn’t always get the attention it deserves, until it ruins your real-time traffic. As IT pros and network admins, you're probably used to dealing with packet loss and latency. But jitter? That one's a bit trickier.

Top Kubernetes Monitoring Tools in 2025, And Why Alerting Is Critical for DevOps and SRE Teams

What are the best Kubernetes monitoring tools in 2025? And how can you ensure alerts actually drive action when something goes wrong? Kubernetes monitoring is critical for keeping your containerized applications healthy, but alerting is often overlooked. This blog compares popular tools like Prometheus and Datadog and explains why intelligent alerting solutions like OnPage are essential for effective incident response.

You can't fix what you can't see, especially when the problem isn't in your infrastructure. #ipm

Most teams monitor from the inside, tracking internal metrics, logs, and uptime. But internal health doesn’t always reflect what your users experience. The internet is made up of many parts you don’t own (ISPs, CDNs, DNS, cloud providers), and any one of them can introduce friction. That’s why monitoring from the outside in matters. By testing from real user vantage points, you get a clearer picture of network reachability and performance as it’s actually experienced.

MCP Server Integration & Much More: What's New in VictoriaMetrics Cloud Q2 2025

Q2 2025 has brought another wave of improvements to VictoriaMetrics Cloud! If you tuned in to our latest Quarterly Virtual Meetup, you saw firsthand how we’re making observability even more accessible, powerful, and interactive.

MCP Observability with OpenTelemetry

2025 has truly been the year of Agentic AI, with MCP (Model Context Protocol) emerging as one of its flashy and most talked-about innovations. While many products have seamlessly integrated MCP servers into their systems, these servers are increasingly being labelled as black boxes, opaque components that handle critical tasks but offer little visibility into what's happening under the hood. We prompt an agent, a tool gets invoked, and a response is generated. But what really happens in between?

Introducing the Coralogix Operator for Kubernetes

As organizations begin to scale their observability strategy, point and click methods of management become increasingly unworkable. This is why Coralogix has now fully released the Coralogix Operator for Kubernetes. Kubernetes operators are control loops that allow users to declare their desired state in their Kubernetes clusters, and the operator is responsible for resolving this state.

Coralogix launches OpenAPI endpoints

Observability is about much more than dashboards and alerts. Extensible platforms that integrate into the user’s tech stack are fundamental parts of a great developer experience. This is why Coralogix has supported gRPC APIs for account management, data ingress & query, alert definition, dashboard creation, permissions management and more. Today, Coralogix adds a new integration, with the launch of OpenAPI endpoints for all existing functionality.

IT Monitoring News | July '25 Edition

Welcome to the July edition of the NiCE bi-monthly IT monitoring news! As we reach the height of summer, we’re thrilled to share the latest updates, insights, and resources to help you stay ahead in IT monitoring. With new developments and recent releases, there’s plenty to discover, enhance, and get excited about. Let’s jump in!

Logging in Docker Swarm: Visibility Across Distributed Services

Docker Swarm's logging model shifts from individual container logs to service-level aggregation. The docker service logs command batch-retrieves logs present at the time of execution, pulling data from all containers that belong to a service across your cluster. This approach gives you a unified view of distributed applications, but it comes with its patterns and considerations for effective observability.

How to Write Logs to a File in Go

When your Go application moves beyond development, you need structured logging that persists. Writing logs to files gives you the control and reliability that stdout can't match, especially when you're debugging production issues or need to meet compliance requirements. This blog walks through the practical approaches, from Go's standard library to structured logging with popular packages.

The Hidden Cost of Downtime: Why IT Leaders Are Prioritizing Resilient Operations

No business sets out to tolerate downtime. And yet, across industries, unexpected service disruptions continue to drain revenue, erode customer trust, and expose operational fragility. For CIOs and IT leaders, the real concern isn’t if systems will break, it’s whether your team can outpace the fallout. Because in a crisis, speed isn’t just an advantage it’s survival.

Elephant Flows: The Hidden Heavyweights of AI Data Center Networks

Elephant flows are no longer rare. They’re foundational to AI workloads. In today’s GPU-heavy data centers, long-lived, high-volume flows can distort ECMP, overflow buffers, and rack up unexpected cloud bills. Kentik helps you see and tame these elephants with real-time flow analytics, automated alerting, and predictive capacity planning.

The Dos and Don'ts of Successful Software Rollouts

Launching new enterprise software is one of the most strategic—but risk-laden—internal initiatives any organization can undertake. Done right, it accelerates transformation, streamlines operations, and boosts employee productivity. Done wrong, it can paralyze teams, spike IT tickets, and erode employee trust in the tools they’re given and the teams that support them.

Can Claude Code Observe Its Own Code?

One of the great things about OpenTelemetry is that it’s a standard, and standards tend to proliferate. I was excited to see Claude Code add OpenTelemetry metric and log support in a recent release. What was really interesting—beyond the ability to capture usage data from Claude Code—is that you can also get pretty detailed logs about what you’re doing with Claude Code.

When Will We See the First $1 Billion Company Run by a Single Individual?

It’s only a matter of time. OpenAI CEO Sam Altman said in 2024 that he thought this could be achieved by the end of 2026. Personally, I feel this is a little optimistic; however, based on the evidence I’ve seen, it won’t be long after that. Consider Telegram: a global messaging giant with just 30 employees, already achieving a remarkable $1 billion in revenue. Or Midjourney, revolutionizing creative industries with only 40 employees and generating an impressive $500 million.

Status Page Aggregator: Best Practices and Use Cases

A status page aggregator is a powerful tool that brings together the status updates of multiple cloud services, SaaS providers, and third-party services into a single, unified view. Whether you’re tracking the health of critical dependencies like AWS, Cloudflare, or niche SaaS tools your teams rely on, a status page aggregator simplifies monitoring and helps you stay ahead of outages.

Automate server restarts in SCOM with the Opslogix Autonomous Maintenance Mode Management Pack

Automate server restarts in SCOM with the Opslogix Autonomous Maintenance Mode Management Pack Server restarts are routine, but in SCOM they often result in unwanted alerts if not handled properly. The Opslogix Autonomous Maintenance Mode Management Pack addresses this by automatically managing maintenance mode during restarts, minimizing false alerts and improving operational efficiency.

Enhanced monitoring of Amazon EKS with Elastic add-on capabilities

Easily enable Elastic add-on within the Amazon EKS Console for streamlined monitoring and quick data onboarding. Amazon Elastic Kubernetes Service (EKS) makes running Kubernetes on AWS simple and scalable. But as your workloads grow, so does the need for robust monitoring and observability. Enter Elastic Agent, a powerful, unified way to collect logs, metrics, and security data from your EKS clusters, all managed through Elastic Fleet.

Perform Distributed Tracing for your MCP system with OpenTelemetry

2025 has truly been the year of Agentic AI, with MCP (Model Context Protocol) emerging as one of its flashy and most talked-about innovations. While many products have seamlessly integrated MCP servers into their systems, these servers are increasingly being labelled as black boxes, opaque components that handle critical tasks but offer little visibility into what’s happening under the hood. We prompt an agent, a tool gets invoked, and a response is generated. But what really happens in between? And when something breaks, how do we trace the failure and debug it effectively?

PHP Monitoring Best Practices for Developers, DevOps, and SREs

In 2025, PHP still powers over 75% of the web from ecommerce platforms like Magento to CMSs like WordPress and Laravel-powered web apps. As user expectations rise and digital experiences become mission-critical, real-time PHP monitoring has moved from a luxury to a necessity. According to Statista, PHP continues to rank in the top 10 most-used programming languages globally. Despite the popularity of modern stacks, legacy and modern PHP coexist in thousands of production environments.

Why GovRAMP-authorized observability matters for state, local, and education IT teams

Building on our FedRAMP Moderate authorization and our “In Process” status for FedRAMP High, Datadog for Government is now "In Process" for GovRAMP High Authorization, giving agencies a unified observability platform that meets the toughest public-sector security bars.

Faster incident response through distributed tracing: Inside Glovo's use of Traces Drilldown

It’s almost 1 p.m. on a Monday afternoon and you’re hungry. You pull up your meal delivery app and select your favorite restaurant and dish. Then you go to check out and nothing happens. Your frustration mounts as you get hungrier by the minute. But there’s frustration on the other side of that transaction as well—engineers are scrambling to figure out what’s wrong as orders drop and revenue losses rise.

Top 5 outages detected by StatusGator in June 2025

June 2025 saw several high-impact outages across popular cloud services — from infrastructure giants like Google Cloud to developer platforms like Supabase and Heroku. For IT teams, MSPs, and developers, even short service disruptions can have ripple effects across workflows and customer experience. At StatusGator, we continuously monitor thousands of services to detect issues in real time — often before they’re publicly acknowledged.

StatusGator now monitors 6,000+ services

Today, StatusGator monitors over 6,000 cloud services and tools — a massive expansion that reflects how far we’ve come, and how deeply embedded we are in the fabric of modern infrastructure. In today’s world, your product’s reliability depends on a web of vendors — authentication providers, analytics platforms, CDNs, payment processors, communication tools, and more. At 6,000+ services, StatusGator now reflects your entire digital supply chain.