Operations | Monitoring | ITSM | DevOps | Cloud

Beyond ping: How OpManager redefines network discovery for modern IT

Today’s networks aren’t just growing, they’re evolving. Hybrid architectures, cloud-native services, and a never-ending stream of connected devices have made it impossible to keep track of what’s on your network manually. This is exactly where a next-gen network discovery tool becomes a game-changer. ManageEngine OpManager is more than a monitoring solution.

VB Transform 2025: The Enterprise AI Revolution Takes Center Stage

Fabrix.ai team attended VentureBeat’s – VB Transform conference returned this week as the premier gathering for enterprise AI leaders, showcasing how artificial intelligence has evolved from experimental chatbots to autonomous agents reshaping entire industries.

Proactive Network Protection with Progress WhatsUp Gold 2025: SSL Certificate Monitoring That Helps Prevent Outages

A single expired SSL certificate can disrupt critical services, erode customer trust, and trigger a series of avoidable issues. That’s why we’re excited to introduce a powerful new feature in Progress WhatsUp Gold 2025.0: Certificate Discovery and Monitoring. This enhancement is more than just a checkbox on a release note; it’s a proactive safeguard designed to help you spot certificate issues before they escalate into business problems.

Install Pandora ITSM from Pandora FMS Console

Until now, deploying Pandora ITSM required a standalone installation, manual database configuration, and later integration with Pandora FMS. With the new NG 783 version, that entire process has been simplified: Pandora ITSM can now be installed directly from the Pandora FMS web console, no additional servers, no external steps, and with integration already configured.

Built for Engineers: Datadog's Vision for the Future

Datadog was built by engineers, for engineers. At, Datadog Co-founder & CEO Olivier Pomel opened the keynote with a clear message: observability, security and AI are converging. From infrastructure to AI Agents, the future of engineering requires one unified platform. Catch all product announcements to see what’s next in observability and security on our Youtube channel!

Get a better structure in your SCOM environment with the Opslogix Classification Management Pack

Get a better structure in your SCOM environment with the Opslogix Classification Management Pack Alerts in SCOM can easily become overwhelming, making your environment feel noisy and unstructured. The real challenge is how you can get the right amount of alerts to the right people, at the right time. The Opslogix Classification Management Pack includes features like tiered classification levels, dynamic grouping, and extended tagging.

Prometheus Gauges vs Counters: What to Use and When

Choosing the wrong metric type in Prometheus can lead to inaccurate dashboards, false positives in alerting, and missed indicators of system failure. Gauge metrics are intended for tracking values that can go up and down, such as memory usage, queue depth, or the number of active connections. Unlike counters, which only increment (or reset on restart), gauges reflect the current state of a resource at scrape time.

How we've created a successful FinOps practice at Datadog

When you adopt FinOps to maximize the value of your cloud spending, you may have some simple first steps you can take to gain cost efficiency. For example, you can find and delete any unused resources to quickly realize a one-time optimization. But the ongoing work to manage cloud costs becomes complex as your organization grows, your infrastructure spans multiple clouds, and you can't easily see the full value of your cloud spending by tracking only the bottom line.

What's New in InfluxDB 3.2: Explorer UI Now GA Plus Key Enhancements

InfluxDB 3.2 is now available for both Core and Enterprise, bringing the general availability of InfluxDB 3 Explorer, a new UI that simplifies how you query, explore, and visualize data. On top of that, 3.2 includes a wide range of performance improvements, feature updates, and bug fixes. InfluxDB 3 Core is free and open source, optimized for recent data, and licensed under MIT and Apache 2.

Operational Intelligence - the new horizon of observability

Monitoring your systems isn't enough anymore. Neither is “asking questions about your system”. Operational Intelligence embraces observability to proactively deliver business insights, support decision-making, and accelerate innovation. It seems that as the observability market grows and more and more products come into the space, the meaning of the term observability itself becomes more and more nebulous.

Going beyond AI chat response: How we're building an agentic system to drive Grafana

As we look at the role AI can play in Grafana going forward, we want to move beyond the simple chat responses that dominate the world of LLMs today and into agentic systems—AI that can understand, reason, and act on your behalf. The ultimate goal is to make it easy to get things done in Grafana using natural language—whether you’re a seasoned SRE or a new developer. And in the AI world, we call this moving from chat completion to task completion.

Do you Grok It?

Most people are probably familiar with the word “grok” from Robert A. Heinlein’s novel A Stranger in a Strange Land, in which it is used to describe a deep, almost mystical understanding of something. ‍ Grok is also the name of a plugin for LogStash that enables you to parse and analyze log data using a syntax similar to regular expressions, but specialized for various log formats and fields.

Custom Alerts in Checkly

Learn how to customize your alerts in Checkly to get only the notifications you need. This video walks through account-wide alert settings, managing alert channels, using groups for business-critical checks, and leveraging Monitoring as Code to manage everything from your IDE. Plus, see how to use the Checkly CLI to import existing checks from the UI into code for full version control and automation.

Optimizing mobile website performance using digital experience monitoring

Delivering an exceptional mobile user experience (UX) is critical for business success. As mobile devices cause over 60% of global web traffic(2024) from billions of active users, a subpar mobile experience can lead to lost customers and revenue. Slow-loading pages and design-induced poor interactivity and unstable layouts frustrate users. Bad UX drives disgruntled users quickly to competitors via a one-way street.

How to Reduce Application Downtime with APM?

According to a recent 2025 study, the average cost of downtime has inched as high as $9,000 per minute for large organizations. For higher-risk enterprises like finance and healthcare, downtime can eclipse $5 million an hour in certain scenarios. Whether you're part of a DevOps team, an SRE, a developer, or an engineering manager, minimizing application downtime should be a critical focus. One of the most effective ways to achieve this is through Application Performance Monitoring (APM).
Sponsored Post

SAP system refresh automation

SAP system refresh automation is extremely powerful when leveraged with care; system refreshes are complex and challenging processes to manage. System refreshes can be fraught with risk for organizations with critical data due to their level of complexity. Mitigating this risk comes down to knowing the benefits of automation and how the processes work. This article will help you: To try out Avantra's SAP automation features, sign up for a free trial.

What's Slowing Down Your App? Common Performance Issues APM Can Solve

Application performance is critical to user experience and business success. When an application starts slowing down, identifying the root cause isn’t always straightforward. For developers, DevOps engineers, and SREs, Application Performance Monitoring (APM) tools provide real-time visibility into how applications behave under load.

How Dropbox rebuilt its logging stack with Grafana Loki after a data center went dark

Two years ago, a power outage knocked a Dropbox data center offline. It wasn’t just any data center. It was the only one where Dropbox hosted Grafana Loki, meaning engineers couldn’t access their log data. “We had considered a data center outage when we were rolling out Loki, but it had just never risen up in priority enough to get put into multiple data centers,” said Chris Hodges, an infrastructure software engineer at the cloud storage company.

Route your monitor alerts with Datadog monitor notification rules

As organizations scale their infrastructure, monitoring systems can become a source of noise rather than insight. A clean, straightforward set of alerts for a handful of services can quickly spiral into a mess of overlapping thresholds, redundant triggers, and inconsequential notifications across hundreds (or thousands) of components. This flood of notifications can slow response times, overwhelm engineers, and increase the chance of overlooking critical problems.

The Open Source Observability Podcast - EP #1: Clickhouse, Data Lakes, and AWS S3 with Joshua Lee

In this episode we get to dive into some of Josh's favourite databases and telemetry sources for observability. Listen to learn what open source software you could benefit from including in your toolstack! Joshua Lee is a Developer Advocate at Altinity, where he applies his observability and engineering background to ClickHouse use cases and creates educational content to support the open source community. He has over 15 years of experience in leading software projects for a broad scope of industries.

How to Handle the NumberFormatException in Java

The NumberFormatException is one of the most common runtime exceptions you'll encounter in Java. It's an unchecked exception that occurs when you try to convert a string to a numeric value, but the string format isn't compatible with the target number type. Simply put, if you attempt to parse "hello" as an integer or "12.5" as an integer, Java throws a NumberFormatException because these strings can't be converted to the expected numeric format.

Drive Public Sector Efficiencies of Scale with Splunk and AWS

Today’s public sector organizations are tasked with delivering a staggering amount of technology capabilities to support a growing set of digital services, meet IT modernization goals, and continue to protect against a wide range of attack vectors. Cloud technology adoption has played a significant role in ensuring that ongoing IT modernization not only aligns with each organization’s mission-strategic capabilities but also enables efficiencies of scale.

CPU monitoring for network admins: Why it matters more than ever

In your role as a network administrator, maintaining smooth, uninterrupted system performance isn’t just a one-time task; it’s your daily mission. Whether you're managing hundreds of endpoints, virtual machines, or hybrid cloud environments, CPU monitoring is one of the most critical tools in your toolkit. Without it, diagnosing performance slowdowns, service lags, or outages becomes reactive guesswork.

Event Intelligence Solutions: The Essential Tools Every ITOps Manager Needs - and How Interlink Software Delivers

david.arrowsmith • June 27, 2025 IT Operations (ITOps) managers need to ensure always-on availability across a more complex and hybrid ecosystem than ever before. Event storms, patchwork toolchains and slow root cause analysis (RCA) impede responsiveness and undermine the high digital performance customers demand. The Event Intelligence and Service Observability Platform from Interlink Software addresses this.

From Detection to Resolution: How Selector + Itential Deliver AI-Driven Observability and Automated Recovery

Every second counts when it comes to detecting, diagnosing, and resolving network incidents, yet many teams still find themselves stuck in reactive mode, drowning in alerts, manually writing scripts, and managing tickets across disconnected systems. This is where Selector and Itential come in. Together, Selector and Itential deliver a powerful, enterprise-ready solution that closes the loop between detection and action.

What's New in Flowmon ADS 12.5?

In this webinar, we’ll introduce the new features, including: AI-Powered Threat Briefings – A new dashboard that correlates global threat intelligence with your network’s current and historical data. Enhanced Event Visualization – Dive into a redesigned event detail streamlining user experience Expert-level recommendations - Guided next steps for each detection, helping analysts of all skill levels validate and resolve incidents with confidence.

Monitoring Behind the Great Firewall

As Site Reliability Engineers (SREs) managing global infrastructure, we face unique challenges when serving users in mainland China. The Great Firewall of China creates a complex web of technical obstacles that can render even the most robust international websites slow, unreliable, or completely inaccessible to Chinese users.

Can AI/ML Guide Observability? Tech Talk #6

This talk will examine the application of Artificial Intelligence and Machine Learning in observability. It will cover how AI/ML is being used to monitor systems, detect anomalies, and extract insights from telemetry data. The session will provide information on integrating AI/ML into observability pipelines, improving analytical capabilities, and system performance.

Nexthink Achieves FedRAMP "In Process" Designation

We are proud to announce a significant advancement in our commitment to serving the US federal market – Nexthink is now listed as “In Process” in the FedRAMP marketplace. To achieve this, we have been working closely with our federal consultant Quzara, to complete a rigorous security assessment. Through this process, we're implementing hundreds of required controls to meet the highest standards of cloud security.

F5 Monitoring on Microsoft SCOM

As part of a recent customer project, we developed a custom F5 Management Pack for Microsoft System Center Operations Manager (SCOM). This bespoke solution enables IT operations teams to monitor the performance, availability, and health of F5 infrastructure directly within the SCOM environment. It provides deep visibility into key metrics, helping ensure application delivery remains stable, secure, and efficient.

Hidden Value in Sumo Logic: What Customers Often Miss -- Customer Brown Bag -- June 26th, 2025

Join us as Andy Makings reveals 12 powerful tips and tricks that many users overlook in Sumo Logic. These practical insights can streamline your daily workflows and unlock deeper, more actionable intelligence from your data.

Improve SLO accuracy and performance with Datadog Synthetic Monitoring

SLOs are key for improving user satisfaction, prioritizing engineering projects, and measuring overall performance. Given the important role that SLOs play in determining organizational benchmarks, teams need to ensure that SLO metrics—also called service level indicators (SLIs)—are reported accurately and maintained consistently within an acceptable range.

Troubleshooting: No data or monitor not created for .NET applications in Site24x7 APM Insight

Are your.NET applications not showing up in Site24x7 APM Insight? This step-by-step video will help you troubleshoot missing data or monitor issues for both IIS-hosted applications and Windows Services. In this video, you'll learn how to: Related links.

How to detect vulnerable GitHub Actions at scale with Zizmor

As we previously reported on April 26, 2025, we had a security incident via an insecure GitHub Action and we have since published a post-incident review. We have confirmed that there has been no code modification, unauthorized access to production systems, exposure of customer data, or access to personal information.

How to Use an AI Assistant with Your Monitoring System - VictoriaMetrics MCP Server

Alex Marshalov explores the new VictoriaMetrics MCP Server. He moves beyond the hype to show what's truly possible today. The presentation offers a builder's perspective on integrating AI with time-series data, featuring a demo that showcases both the potential and the current realities (yes, there are some). See how we're thinking about solving complex monitoring challenges with AI. Resources for Further Learning.

Next Level - Infrastructure Monitoring and Load Balancing

Next Level - Infrastructure Monitoring and Load Balancing; Are you getting the most out of these solutions? Modern network infrastructures are complex and yours is no different. As such, a load balancing solution is required to keep the servers up-and-running for both your customers and employees. The same goes for network traffic monitoring and analysis to understand and comprehend user behavior.

Prometheus and CloudWatch Integration for AWS Metric Collection

The Prometheus CloudWatch exporter pulls AWS CloudWatch metrics into your Prometheus setup, giving you a unified view of your infrastructure alongside application metrics. If you're already running Prometheus and need visibility into AWS services like EC2, RDS, or Lambda, this exporter handles the integration without forcing you to switch monitoring stacks.

Elastic Cloud Serverless now generally available on Microsoft Azure

Elastic Cloud Serverless provides the fastest way to start and scale security, observability, and search solutions — without managing infrastructure. Today, we are excited to announce the general availability of Elastic Cloud Serverless on Microsoft Azure — now available in the EastUS region. Elastic Cloud Serverless provides the fastest way to start and scale security, observability, and search solutions without managing infrastructure.

Fabric Interconnect: Connecting Servers with UCS Hardware

Every IT decision-maker faces a common challenge: balancing operational efficiency with cost control. While software solutions help streamline operations and drive efficiency, they can introduce redundancies into your system. These redundancies strengthen availability through backup systems but often complicate data management, leading to inconsistencies and potential outages. This is where hardware solutions like fabric interconnects prove invaluable.

Your integrated MSP platform is here - announcing the launch of MSP Central!

For MSPs, the daily juggling act of managing multiple tools can be a drain on resources, efficiency, and, ultimately, your bottom line. You need comprehensive visibility, streamlined workflows, and the ability to proactively address client needs—all without the headache of disparate systems. That's why we're excited to announce the launch of MSP Central, your comprehensive, unified platform for streamlined MSP business management.

Logz.io Adds PrivateLink Support, Introduces the Parsing Rules Hub, and Significantly Enhances Parsing Capabilities

Today, we’re excited to announce support for AWS PrivateLink, allowing Logz.io customers to securely send logs and metrics through private VPC connectivity, without any data ever hitting the public internet. If you’re running workloads inside a VPC on AWS, this upgrade drastically improves your security posture, simplifies your networking architecture, and – most notably – reduces your data transfer costs (a lot).

Microservices to Monolith, Rebuilding Our Backend in Rust

The following serves as a practical guide for those looking to simplify their architecture by migrating to a Rust monolith. Earlier this year, the platform team at InfluxData undertook a major rewrite of our core account and resource management APIs, moving from Go to Rust and from a microservices architecture to a single monolith. This change supported a new administrative UI for InfluxDB Cloud Dedicated and aligns with our broader effort to rewrite the InfluxDB database engine in Rust.

Can AI/ML Guide Observability? Tech Talk #6

This talk will examine the application of Artificial Intelligence and Machine Learning in observability. It will cover how AI/ML is being used to monitor systems, detect anomalies, and extract insights from telemetry data. The session will provide information on integrating AI/ML into observability pipelines, improving analytical capabilities, and system performance.

Elastic's journey to build Elastic Cloud Serverless

Stateless architecture that auto-scales no matter your data, usage, and performance needs How do you take a stateful, performance-critical system like Elasticsearch and make it serverless? At Elastic, we reimagined everything — from storage to orchestration — to build a truly serverless platform that customers can trust. Elastic Cloud Serverless is a fully managed, cloud-native platform designed to bring the power of Elastic Stack to developers without the operational burden.

Introducing AI Agent Monitoring

AI is changing how we build software — but debugging code still comes down to having context. One minute the model’s performance is cruising. The next, you’re hit with a KeyError from a tool you forgot existed, triggered by a model that silently timed out, and a retrieval call that returns... nothing, or 11 “Let me try this a different way" messages before failure. You’re stitching together LLM calls, agents, vector stores, and custom logic. Then hoping it holds up in prod.

Achieving Full Visibility: Modern Monitoring for Distributed Cloud Applications

Today’s applications are hybrid, cloud-centric, service-oriented, API-dependent, and geographically distributed. The monitoring practices we relied on for decades are no longer sufficient. It is critical to monitor all the internet-centric dependencies, connectivity, and cloud application components – and to do so from the user’s perspective so IT operations teams can achieve digital resilience and deliver performance. This session will cover DEM, APM, and IPM and how they can work together to pinpoint issues before they occur, so users receive a great digital experience.

Introducing AI Agent Monitoring in Sentry

Monitoring agents and LLM applications is... different. Managing everything from tool calls, to model configurations, token usage, and AI systems do their best to solve problems on their own - so errors aren't always clear. Sentry's agent monitoring focuses on making it easy to dive into your AI applications and understand whats breaking, where, so you can fix it faster.

How Sentry's Seer AI Agent passes legal review: a guide for legal teams reviewing Seer

If your legal department is anything like ours, you’re being inundated with requests from the business to use more and more AI tools. Whether it's developers wanting to use coding agents like Cursor, to security implementing AI-driven investigations, to sales and marketing leveraging AI for call insights and competitive research, we've seen a shift in what teams are trying and buying.

Now you can use Sentry Insights to trigger alerts and debug issues

You deploy a fix late Friday and spend the weekend refreshing dashboards, hoping nothing breaks. You shouldn’t have to babysit a dashboard to know when something’s wrong. With the latest updates to Insights, you can now create alerts directly from any chart. Whether it’s a spike in 4xx errors after a deploy, a jump in P95 latency for an API endpoint, or a drop in throughput for a background job, you can set up alerts with just two clicks.

Understanding APM and Distributed Tracing in the Observability Stack

To keep modern applications running smoothly, you need more than just basic monitoring. APM (Application Performance Monitoring) gives you a broad overview, tracking metrics like latency, errors, and system health. Distributed Tracing, on the other hand, shows the full journey of each request across services, helping you pinpoint the root cause of slowdowns or failures.

Grafana Cloud updates: The latest features in Kubernetes Monitoring, Fleet Management, and more

We consistently roll out helpful updates and fun features in Grafana Cloud, our fully managed observability platform powered by the open source Grafana LGTM Stack ( Loki for logs, Grafana for visualization, Tempo for traces, and Mimir for metrics). In case you missed them, here’s our monthly round-up of the latest and greatest Grafana Cloud updates.

Trace Distributed Map states for AWS Step Functions with Datadog

AWS Step Functions offers the Distributed Map state, enabling you to coordinate massively parallel workloads within your serverless applications. With this feature, a single Step Functions execution can fan out into up to 10,000 parallel workflows simultaneously, making it possible to efficiently process millions of items in parallel. This capability unlocks new possibilities for large-scale data processing, such as image transformation, log ingestion, or batch analytics.

What is log tagging and how to configure it in Site24x7

In this video, learn what is Site24x7's log tag and how to configure, categorize, filter, and monitor your logs more effectively—so you can create your custom log tag that gives you full visibility into your logs or categorize them even better. Here’s what you’ll learn: Whether you're an IT personnel, DevOps engineer, or security analyst, this video will help you make smarter tags for monitoring decisions.

Infrastructure monitoring with Site24x7 | Cloud, Kubernetes, and Hybrid Environments

Modern IT environments are dynamic, distributed, and constantly evolving. You need more than traditional monitoring to keep everything running smoothly. Site24x7 is your all-in-one, AI-powered infrastructure monitoring solution. What this video covers: Whether you're overseeing AWS, Azure, GCP, OCI, VMware, or Kubernetes, Site24x7 simplifies it all with a single agent and AI-driven insights.

The Road to Loki 4.0 (Loki Community Call June 2025)

In this Loki Community Call, we welcome back Ed Welch, Principal Engineer on the Loki team. We will be discussing with Ed what is next for Loki as we push forward to Loki 4.0. If you are interested, learn more about potential architecture changes, storage formats, and an open discussion on where Ed and the Loki team would like to see the future of Loki, then make sure you join us live and have your questions answered!

Observability Across Asia-Pacific: What's Holding Teams Back? | 2025 Observability Survey Analysis

What’s holding back observability maturity in Asia-Pacific? Grafana Labs' cofounder Anthony Woods shares key takeaways from the largest global observability survey. Learn how SaaS, budget concerns, and org structure are shaping Asia-Pacific (APAC)'s future. Grafana Cloud is the easiest way to get started with Grafana dashboards, metrics, logs, and traces. Our forever-free tier includes access to 10k metrics, 50GB logs, 50GB traces and more.

What Is Session Replay and How It Improves User Experience in IT Environments

Anyone who works in technology quickly learns this truth: users will always interact with systems in the most unexpected and baffling ways… and when something goes wrong, they swear they “didn’t touch anything.” There’s a vast ocean between how something is designed and how it’s actually used—an ocean filled with bugs waiting to be caught. But there’s a way to bridge that gap: session replay.

How to Reduce IT Costs on Hardware Refresh Cycles

IT budgets are under pressure, and hardware refresh costs continue to climb. For End User Computing (EUC) and IT professionals, the traditional time-based approach to managing device lifecycles is no longer viable. Simply replacing laptops and desktops every three to five years doesn’t reflect actual device performance, usage patterns, or business needs. The solution? A smarter, data-driven hardware refresh strategy that balances performance, cost-efficiency, and employee experience.

Introducing Cause Analysis: Instant Triage for Traffic Changes with Kentik AI

Introducing Cause Analysis from Kentik, designed to simplify network traffic analysis and rapidly identify the root cause of issues. Learn how this exciting new feature streamlines troubleshooting, makes complex insights accessible, and boosts team efficiency for all users.

Amazon SQS Metrics: Monitor, Debug, and Optimize Your Message Queues

Message queues quietly take care of a lot—buffering workloads, smoothing traffic spikes, and keeping services connected. But they don’t always get much attention until something feels off. Amazon SQS offers a solid set of metrics to help you understand how your queues are doing, whether you’re scaling well or nearing limits. This blog breaks down the key SQS metrics: where to find them, what they mean, and how to respond when things start to shift.

How to Configure Docker's Shared Memory Size (/dev/shm)

Your Node.js app runs fine on your machine. But inside Docker? You start getting weird crashes—ENOSPC: no space left on device. Chrome headless tests fail out of nowhere. PostgreSQL throws shared memory errors under load. The problem? It’s probably /dev/shm, the shared memory volume Docker sets up by default. Most containers get just 64MB of space here.

How to Create a Free Status Page in Under 5 Minutes

Your website goes down at 2 AM. Your customers wake up to broken services, flooded support inboxes, and zero communication from your team. By the time you're awake and fixing things, trust is already damaged. A status page prevents this nightmare scenario. But here's the thing — most teams keep putting it off because they think it's complicated, expensive, or time-consuming. It's not. You can create a professional status page in under 5 minutes, completely free. I'll show you exactly how.

How One MSP Used AI to Cut Noise by 78% and Reclaim Engineering Time

An operations team at one of the Asia-Pacific’s largest managed service providers (MSPs) was drowning in their own success. Years of investment in monitoring tools and automation had created comprehensive visibility—and comprehensive chaos. Engineers opened dashboards each morning to find thousands of alerts waiting, with critical incidents buried somewhere inside. The scale of the problem was overwhelming their capacity to respond effectively.

Why synthetic testing is the secret to proactive Teams management

The more organizations depend on collaboration solutions like Microsoft Teams for productivity, the more IT departments are expected to ensure a seamless experience every time. That demands more than just rapid troubleshooting when issues occur: it requires IT teams to get ahead of problems and keep them from affecting users in the first place. For that, synthetic testing is a must.

See more, solve more with end-to-end network path tracing

Few things hold IT teams back more than a lack of visibility. It’s exponentially harder to solve issues when they originate in parts of the environment you can’t see. That’s one of the big limitations of native tools for monitoring and managing Microsoft Teams. Microsoft Call Quality Dashboard, Admin Center and Service Dashboard, and Meeting Room Pro Dashboard are all constrained to the aspects of Teams that Microsoft controls directly.

Top 10 Network Monitoring Tools to Boost Your IT Performance

In today's digital scene, a strong and secure network forms the foundation of any organization. When networks go down, face performance issues, or encounter security risks, companies can suffer big money losses and damage to their reputation. IT teams need network monitoring tools to stay on top of performance, spot problems, and keep things running. As AI, cloud-based answers, and automation get better, 2025 brings a bunch of powerful tools to make your IT setup work better.

Why do hotel rooms have smoke detectors in every room, not just one on every floor?

Early detection matters. When a problem occurs, you want to know immediately, not after the damage is done. Monitoring isn’t just about visibility; it’s about precision, speed, and proximity to the problem. Just like smoke detectors, you need to monitor in the right places: close to your critical infrastructure, applications, and end users. The sooner you detect issues, the cheaper and easier they are to fix. And that’s where real resilience begins.

Fireside Chat: Observability Lessons and Practices from a Fortune 500 Leader

Join SAP CX's Martin Norato Auer, VP of Observability, and Catchpoint’s Nick Homan as we explore SAP CX’s journey from fragmented alert management to a scalable, standardized observability model. In this candid fireside chat, Martin shares how his team overcame alert fatigue, integrated observability with automation and BI, and scaled their practices across multiple SAP CX products with APM & Internet Performance Monitoring (IPM).

Observability Without Tradeoffs: Introducing Powerful New Honeycomb Telemetry Pipeline Features

Every day, enterprise companies generate terabytes of observability data while engineering teams are under pressure to cut costs. One of the easiest ways to reduce observability bills is through sampling: intentionally sending only a representative portion of telemetry data, rather than the full volume, to your observability tool. But turning down the dial is risky.

11 Best Log Monitoring Tools for Developers in 2025

Your checkout API just started throwing 500s during peak traffic. You SSH into production, tail logs across six microservices, and realize the database timeout buried in service's logs is causing cascade failures. Two hours later, you've fixed it, but you're thinking: "There has to be a better way." There is. Log monitoring tools centralize logs from your entire stack, making debugging systematic instead of archaeological.

Grafana Cloud: Manage the AWS Observability app as code with Terraform

Imagine setting up your AWS configuration in Grafana Cloud by hand and clicking through menus. When you only have a few services, it’s not a big deal. But as you add more and more, keeping track of every little change becomes a headache. It’s easy to make mistakes, and before you know it, things can get out of sync and your monitoring becomes unreliable.

How Cursor scaled infrastructure rapidly and reliably using Datadog

At Datadog, we use Cursor to empower our teams to build more quickly. And we know that building and troubleshooting with AI tools like Cursor is done best with the right observability data and context. Discover how Cursor was able to rapidly and reliably scale their infrastructure 100x using Datadog to meet the needs of a fast growing user base. And learn more about how we’re bring Datadog tools and context to your favorite AI IDEs and agents with our MCP Server and extensions.

How to fix high CPU temperature: A network admin's checklist

It’s 2 AM. Your phone buzzes. A critical server’s CPU is maxing out again. But this time, the issue isn’t just high usage. It’s heat. As a network admin, you’re trained to monitor traffic patterns, patch vulnerabilities, and respond to performance slowdowns. But high CPU temperature? That’s the silent system killer many still underestimate. Without a proactive plan, it can knock out performance, rack up hardware costs, and shorten the lifespan of your infrastructure.

Data Center Ops with InfluxDB 3: From Raw Metrics to Actionable Insights with Ease

Modern data centers generate enormous volumes of telemetry from servers, switches, cooling systems, power infrastructure, and environmental sensors. Operations engineers must capture, store, and analyze this data in real-time to monitor uptime, maintain energy efficiency, and perform predictive maintenance using AI. Legacy monitoring systems struggle to meet today’s volume, cardinality, and latency demands.

AI Test Generation and PR Review in Sentry (Now in Open Beta)

You write code. Open a PR. CI runs. PR merges. Prod’s on fire by 5pm. Maybe you skipped writing some tests. (It's tedious, sometimes unclear, and easy to ignore when you're racing to ship—until something breaks and you realize a test could’ve saved your Friday night.) Maybe the PR review was more of a drive-by from a teammate who barely had time to skim the diff. But reviews and tests matter.

How to connect your AWS account with Site24x7 using IAM role | Step-by-step tutorial

In this step-by-step tutorial, learn how to securely connect your AWS account with Site24x7 using IAM role-based cross-account access. We’ll guide you through: This method ensures secure, read-only access to your cloud environment while enabling real-time monitoring and alerting via Site24x7.

FIPS 140-3 Compatible Builds for VictoriaMetrics Enterprise Components

VictoriaMetrics introduces FIPS 140-3 compatible builds for its components, starting with version 1.117.0. These builds utilize Google’s FIPS 140-3 validated BoringCrypto module. This is critical for customers in regulated environments (federal government, finance, healthcare) to meet FIPS 140-3 cryptographic requirements for data encryption, TLS, and secure communications.

Observability 2.0: Seeing More, Knowing More, Fixing More

The era of scattered monitoring tools and fragmented operational visibility is over. As hybrid and multi-cloud environments have become the norm rather than the exception, traditional observability approaches—siloed metrics, isolated logs, and disconnected traces—can no longer keep pace with the complexity of modern IT infrastructure. Organizations today need more than just monitoring.

Naming your kernel objects

When using Percepio TraceRecorder, kernel objects like queues, semaphores and mutexes are named using their address by default. This can be a bit hard to follow for complex traces. However, it is quite easy to set more descriptive custom names for your RTOS kernel objects. This by calling the “SetName” functions (or macros) found in the TraceRecorder API, for example: The first argument is the pointer to the object (i.e. the object address).

Interacting With Log Data in Security Event Manager

SolarWinds Security Event Manager is designed to give users a centralized view of logs and events occurring across their network, and quickly and easily recall specific logs and identify suspicious patterns and behaviors in that data. This video gives a quick overview of the features in SEM, making it easy for users to view and interact with their log data.

A Guide to Effective Network Load Testing & Load Balancing

When it comes to network management, there are two challenges that are ever-present; ensuring optimal network performance and maintaining uninterrupted network connectivity. Network admins are the unsung heroes, diligently managing the digital highways that connect the modern world. To maintain the delicate balance between seamless user experience and network reliability, two crucial practices come to the forefront: Network Load Testing and Load Balancing.

Stay Compliant: Meet Your Audit Needs with Datadog!

Datadog's internal compliance team has built audit workflows and control monitoring capabilities using the Datadog platform. We actively use these capabilities to scale our audit programs and comply with multiple compliance frameworks. This session will go into the details of how we addressed our compliance use-cases using the Datadog platform and how our customers can get started.

Introducing ZTB - Defining Zero Trust for Bring Your Own Cloud (BYOC)

Isn’t the "Bring Your Own Cloud" (BYOC) model the latest hot topic in the evolution of cloud-native architecture, especially for companies offering cloud-hosted platforms that must be deployed in the customer’s cloud for privacy, control, or compliance reasons? Over the past few weeks, we have been rigorously researching and discussing how to build a secure BYOC model.

Zero-effort alert migration from Prometheus to Coralogix

Having spent two decades in technical leadership, I’ve seen first hand what separates great development teams from merely good ones. It’s not about the number of features shipped or the elegance of the codebase — it’s about the ability to consistently deliver value to the customer through really great user experience.

Escalating risk, shrinking margins: The 2025 Internet Resilience Report

When we first launched Catchpoint’s Internet Resilience Report back in 2024, we were already seeing troubling cracks in the digital foundations of major businesses. Remember the CrowdStrike outage? Fast-forward to this year, and it's clear the stakes have only gotten higher. Google Cloud’s recent outage is yet another reminder of how tightly interwoven the Internet is and how all it takes is for one major player to go down, for thousands of businesses to be affected worldwide.

OpenTelemetry vs Fluent Bit - Key Differences 2025

Modern applications demand strong observability to ensure performance, reliability, and quick troubleshooting. Two powerful open-source tools, OpenTelemetry and Fluent Bit play key roles in this space. While OpenTelemetry offers a full-stack framework for collecting metrics, logs, and traces, Fluent Bit specializes in fast, lightweight log forwarding.

Coralogix adds OTel-based service dependency tracking for distributed systems

Coralogix has released its APM Dependencies feature. This feature automatically surfaces and maps the relationships within and between your software and external services. It allows fine grained tracking of which endpoints within your APIs, depend on other endpoints, or external services and database tables.

Top 14 Best Infrastructure Monitoring Tools & Solutions in 2025. Full Reviews and Side by Side Comparison

As your business grows, so will your infrastructure and the number of applications or services running in it. In other words, forget about any sort of manual monitoring or home-grown scripts or tools if you want to keep your sanity. Whether you need performance metrics, service health and availability status, infrastructure, or application logs, you need a tool that will give you end-to-end visibility into the health of your infrastructure.

We did it! SquaredUp is now a B Corp

After four years. Hundreds of meetings and conversations. Countless forms and paperwork submissions… …We’ve done it. SquaredUp is finally, officially B Corp certified We can now say that we are part of a unique global movement – and we couldn’t be more proud, excited, and motivated for our journey ahead.

Maintenance Window Improvements

We've made major improvements to maintenance window notifications with flexible options that adapt to your communication strategy. Now you have three notification options for every maintenance window: You'll also see how many subscribers will be notified with a detailed breakdown of subscriber counts by channel type (Email, Slack, Teams, etc.), giving you complete visibility into your communication reach before sending.

The Benefits of Using Juniper's Network Monitoring Tools for IT Operations

More data means more complexities in IT networks. Hence, the right solution is needed to monitor such networks. Many companies struggle without the right tools, and they often lose great business opportunities because they are unable to identify performance-related issues upfront. Network monitoring is thus essential for business success. It helps build healthy network performance, saving companies money in the long run.

OpManager earns triple recognition in 2025

We’re pleased to share that ManageEngine OpManager has earned recognition across three critical areas of IT operations, achieving triple crown status in IT infrastructure management. OpManager has been featured in GetApp’s Category Leaders, Software Advice’s Front Runners, and Capterra’s Shortlist, in addition to being named in the Gartner Market Guide for IT Infrastructure Monitoring (ITIM).

The Visibility vs Cost Trap: A Dangerous Tradeoff

“You can’t investigate what you don’t have”. Every analyst knows the pain of missing context. You’re in the middle of a high-stakes investigation, but the logs you need are gone, archived weeks ago due to retention limits. Or worse, they were never collected in the first place to keep costs under control. This is the Visibility vs. Cost trap, and it puts analysts at a disadvantage every day.

Harnessing Network Observability to Speed the Telco-to-Techco Transition

For telecommunications firms (telcos), the race is on. If these organizations are to rise to meet their top challenges and growth objectives, transformation is a must. Those who make this move most rapidly will be best positioned for sustained success. Today, telcos face several significant challenges, which are creating fundamental disruption: Telcos need to transform to contend with these shifts.

Generating Playwright Tests With AI: Let's Try the New Playwright MCP Server!

In this video, Stefan (Playwright Ambassador) dives into the integration of AI with the Playwright MCP server to automate end-to-end test generation. Learn about MCP, browser automation and how to combine everything to generate Playwright tests. We'll explore AI capabilities and limits and discuss best practices for generating accurate and reliable Playwright tests. If you're curious about leveraging AI for end-to-end testing with Playwright, this video is for you!

AI-Augmented Control Plane: Scaling IT Operations with Intelligent Automation

How do you enable a team of 100 engineers to effectively support 300+ critical applications across five hosting platforms? At Thomson Reuters, we turned to AI - not as a buzzword, but as a genuine force multiplier. Experience our journey of transforming traditional IT operations into an AI-augmented powerhouse, where Datadog, ServiceNow, and custom AI solutions work in harmony to create a next-generation control plane. We'll share real victories, honest challenges, and practical insights from our mission to build a more intelligent operational framework.

Real User Monitoring (RUM) vs. Synthetic Monitoring: Understanding Best Practices

For modern engineering and DevOps teams, user experience isn’t a post-deployment concern, it’s a critical operational metric. Monitoring how real users interact with your application is no longer optional, especially in high-traffic, dynamic, or global environments. This is where real user monitoring (RUM) proves indispensable. But RUM isn’t the only approach.

Getting started with Cloudflare dashboards

Cloudflare is a widely adopted web performance and security platform, best known for its CDN, DDoS protection, and DNS services. While it provides rich telemetry and real-time analytics, the sheer volume and complexity of the data can make it hard to identify key trends or issues at a glance. This is where a solution like SquaredUp (or another dashboarding tool) comes in.

Custom timeframes are here!

In the realm of data and observability, timing is everything. Until now, SquaredUp provided fixed time options like the last hour, 12 hours, 24 hours, last week, and this month. While these options served many users well, we recognized that they lacked the flexibility you needed. Whether you're tracking long-term performance, comparing trends, or looking into specific events, we know these preset options could sometimes feel limiting.

Farewell, Cherwell: Celebrations, Migrations, and Considerations

As Cherwell approaches end of life, IT service management (ITSM) teams are facing major decisions about what’s next. In this episode, SolarWinds host Sean Sebring sits down with Matt Neigh, a former Cherwell executive, and Michael Clark, a SolarWinds Solutions Engineer and former Cherwell admin, to reflect on Cherwell’s legacy, explore migration best practices, and discuss what to look for in a modern ITSM platform.

Advanced Threshold Configurations in Site24x7

Are constant, trivial alerts overwhelming your IT and DevOps teams? In this video, learn how Site24x7's Advanced Thresholds provide smarter alerting by understanding meaningful patterns and anomalies, improving focus and response to real issues. We'll walk you through: Whether you're a system admin, network engineer, or IT manager, this feature helps you streamline alert management.

Structured Logging in NextJS with OpenTelemetry

Traces tell you what happened and when. Logs tell you why. When something breaks, logs are often your first clue—and if they’re correlated with traces, they can cut debugging time down from hours to minutes. In this section, we’ll wire up end-to-end structured logging across both server and browser environments in your Next.js app, complete with trace correlation and SigNoz integration.

Brand-Driven Observability: Crafting Monitoring That Reflects Your Product Identity

In the fast-paced world of modern IT operations, observability has become a crucial pillar in ensuring the health, reliability, and performance of complex systems. As organizations scale their infrastructures and embrace distributed architectures, monitoring systems have evolved beyond simple uptime checks to holistic observability platforms. However, in this technical landscape, one often overlooked element is the role of branding in observability design.
Sponsored Post

MariaDB Monitoring for Enhancing Performance, Availability, and Security

As organizations increasingly rely on MariaDB for their critical applications, ensuring optimal database performance, availability, and security becomes essential. This whitepaper provides a strategic guide to mastering MariaDB monitoring, helping IT teams proactively detect and resolve issues before they impact business operations.

How the Factry Historian data source for Grafana enables data-driven insights for factory teams

Frederik Van Leeckwyck is the co-founder and CRO at Factry. He oversees go-to-market activities and ensures their software solutions align with real factory demands. Passionate about open technologies, he believes in making data-driven insights accessible to everyone in the factory. Factories today are often rich in process data, but poor in insights.

Fluentd vs Fluent Bit: A Side-by-Side Comparison 2025

Fluentd and Fluent Bit are both open-source data collection and processing tools, but they serve different purposes. Fluentd offers a comprehensive, plugin-rich architecture ideal for centralized log aggregation. Fluent Bit is designed for performance and efficiency, making it a better fit for edge devices and environments with limited resources. This Fluentd vs Fluent Bit comparison outlines their key differences, helping you decide which fits your infrastructure best.

Highlight reel: Futureproof Your AI Investment With Observability

Artificial intelligence is changing the way modern systems are built—and how teams are expected to and operate them. But as AI-driven complexity grows, so too does the need for deep, reliable, and fast visibility into what’s really happening inside our. In this timely and thought-provoking session, Christine Yen, CEO and Co-founder of Honeycomb, explores how practices must evolve to keep pace with.

Prometheus Logging Explained for Developers

Running apps in production? You need visibility fast. Traditional logging gives you scattered events. Prometheus gives you structured, queryable data that scales. In this guide, we’ll break down how to use Prometheus for logging-style observability, where it fits in your stack, and how to plug it into tools like Grafana or your cloud-native setup.

In Case You Missed it: DX NetOps Active Experience Launched

There’s no doubt that managing networks today is a whole different ballgame than it used to be. Complexity is growing, environments are more fragmented, and user expectations have never been higher. One of the biggest challenges for network operations teams? Visibility—or the lack of it. Network operations used to be much simpler. Traffic flowed through your own data center, and you had the visibility and control needed to manage performance and troubleshoot issues.

What is Internet Jitter & How to Test It

If you’ve ever had a user complain that their video call was choppy or their VoIP call had weird delays, even though the Internet speed looked fine, you’ve probably run into Internet jitter. It’s one of those issues that doesn’t always show up on a speed test, but it can absolutely wreck real-time communication. And if you’re managing networks across remote offices, home setups, or hybrid work environments, you’ll want to keep an eye on it.

VictoriaLogs Unleashed: Cluster Version Now Available for Exceptional, Linear Scaling

You asked, and we listened! We’re thrilled to announce the release of the VictoriaLogs Cluster version – one of the most requested and anticipated updates from our user community. This marks a significant leap forward for VictoriaLogs, empowering users to handle log volumes and ingestion rates far beyond the limits of a single node.

The Cost of Bad Data: Why Time Series Integrity Matters More Than You Think

Data plays a critical role in shaping operational decisions. From sensor streams in factories to API response times in cloud environments, organizations rely on time-stamped metrics to understand what’s happening and determine what to do next. But when that data is inaccurate or incomplete, systems make the wrong call. Teams waste time chasing false alerts, miss critical anomalies, and make high-stakes decisions based on flawed assumptions.

Building and Using a Custom #OpenTelemetry #Collector with #Bindplane

Check out the full ‪‪@bindplane community call in June. We explore building custom OpenTelemetry collectors with the OpenTelemetry Distribution Builder and using Bindplane's new Bring Your Own Collector feature. We showcase source and destination compatibility within Bindplane and how BYOC does not let you misconfigure a custom built collector.

Bindplane Recommendation Engine: Automatically Improve Telemetry Parsing #opentelemetry #collector

Check out the full ‪‪@bindplane community call in June. See how Bindplane instantly suggests improvements using its recommendation engine. This video explores how to automatically parse severity with default values, enhancing data analysis efficiency. Learn how to quickly optimize your setup.

Regex Log Parsing Made Easy with AI/LLM Support #opentelemetry #collector #observability

Check out the full ‪‪@bindplane community call in June. We explore Apache HTTP source and the new AI regex log parsing capabilities. We leverage a Bindplane processor for complex pattern matching, enabling efficient data processing. This guide demonstrates how to easily generate and apply regex patterns with AI support.

Docker Stop vs Kill: When to Use Each Command

When a container starts consuming excessive memory or becomes unresponsive, you need a way to shut it down. The two primary options — docker stop and docker kill,both terminate containers, but they operate differently and have different implications. The key difference: docker stop sends SIGTERM for a graceful shutdown, then escalates to SIGKILL if the process doesn’t exit in time. docker kill skips straight to SIGKILL, terminating the container immediately.

Blueprints Are Pre-Built Processor Bundles #opentelemetry #collector #observability

Check out the full ‪‪@bindplane community call in June. Here we explore automated JSON parsing using a JSON Parse processor bundle that was added with the Blueprint feature in Bindplane. Learn how to parse JSON strings, extract fields, and set accurate timestamps without having to add any custom configs. Bindplane handles all the heavy lifting automatically.

From the source to the edge: the six agent types you can't ignore

Recently, Catchpoint expanded our Global Agent Network to over 3,000 agents. In a crowded space, this is by far one of our key differentiators. At the time of writing, no one else boasts 395 providers in 105 countries and 346 cities. As Director of ISP Strategy, I’m not here to pat myself on the back—my real question is: why?

Top tips: Fly high with AI-benefits of artificial intelligence in aviation

Top tips is a weekly column where we highlight what’s trending in the tech world today and list ways to explore these trends. This week, we’ll look at two ways AI is optimizing flying for the passenger as well as the airline. "Brace! Brace! Brace!" Simple request. Serious consequences. Something that could get even the most vocal atheist to start praying. Something that no one would ever wish to hear in their lifetime.

What is Real User Monitoring (RUM)?

As applications grow more complex and user expectations rise, delivering seamless and high-performing experiences to users is non-negotiable. Real User Monitoring (RUM) has emerged as an essential technique that provides developers, DevOps teams, and site reliability engineers with deep visibility into the actual performance of web applications that capture the experiences of real people in real-time.

Why Clarity Demands More Than Dashboards

Despite years of investment in observability stacks and AI dashboards, most IT organizations still struggle with one uncomfortable truth: they can’t identify root cause in real time, and they can’t explain how technical failures impact the business. Not in dollars. Not in user flows. Not in boardroom language. What’s worse, they often don’t realize what they’re missing.

WWDC 2025: What's new for enterprise device management

Apple’s WWDC 2025 delivered a wave of exciting updates for anyone involved in managing company devices. With improvements designed to simplify provisioning, strengthen app controls, and expand what Apple Business Manager can do, these changes are all about making life easier for IT teams irrespective of the industry. In this article, we’ll break down the key announcements and explore how they could reshape the way you manage your organization’s Apple devices.

Creating a Java monitoring strategy for high-availability systems

High-availability (HA) systems form the backbone of modern enterprise applications. In today's always-on world, Java applications are expected to deliver consistent performance with minimal downtime. However, achieving this critical objective is impossible without a well-defined and executed monitoring strategy. A robust Java monitoring approach is essential to ensure resilience, uptime, and peak performance.

Securing AI with AI-SPM: The Next Step in AI Risk Management

The conversations around artificial intelligence (AI) typically revolve around its vast potential: writing applications, automating tasks, or transforming entire industries. However, despite the excitement around AI’s potential, the more pressing issue for many organizations is how to manage the risks of deploying it at scale across the enterprise. This is where AI Security Posture Management (AI-SPM) comes into play.

Getting Started with Traceroute

“Traceroute? You mean the thing I can type at the command line? Why would I even want to set up a test for that?” This is, believe it or not, a comment we hear a lot at Catchpoint. At least from folks who are either new to tech, new to monitoring, or new to Catchpoint (or all three). It’s a common misconception. It’s also something I’m not going to spend a ton of time addressing here. This blog is not meant to convince you why traceroute is super useful (even though it is).

Access Logs: Format Specification and Practical Usage

Your server's been logging everything—it’s just easy to overlook until something breaks. Every incoming request, database call, or auth check ends up in your access logs. They’re not flashy, but they quietly document every interaction your system handles. For developers, they’re often the most reliable starting point when things go wrong. In this blog, we'll take a look at what an access log is, its format, types, and a few best practices.

Honeycomb Observability Day London: A Jam-Packed Day of Great Talks

On May 15th, 2025, Honeycomb hosted Observability Day (or O11yDay) in the London financial district. The skies were clear and the weather was wonderful and we had a huge turnout, from our networking breakfast to the happy hour at the end of the day.

A New Look At Dependencies: Icinga Dependency Views

We’re excited to share that Icinga now offers an improved way to view dependencies. With the releases of Icinga DB Web 1.2.0, Icinga DB 1.4.0, and Icinga 2.15.0 today, any dependencies you’ve set up in Icinga will now be visually represented. Additionally, we’re introducing a new enterprise feature called Icinga Dependency Views, available through an Icinga subscription. This component expands Icinga DB Web with even more powerful capabilities.

Get more out of Sumo Logic: five log search hacks you'll actually use

Think Sumo Logic is only for query language pros? Think again. Whether you’re deep into JSON logs or just trying to make sense of a Linux error message, these five time-saving hacks turn anyone into a log-searching ninja, no regex, no complexity, just clicks. From instantly parsing values to filtering down with a tap, these tips will help you troubleshoot faster, work smarter, and feel more confident in your observability game. You’ve got logs, now it’s time to put them to work.

LLM Observability for Reliability and Stability: A Monitoring Strategy for Phone Communication

LLM APIs offer groundbreaking potential, but also present challenges such as response latency, hallucinations, and service instability. In Japan, where telephone communication remains crucial for business, these issues present significant barriers to the introduction of LLM-based applications. Despite being a relatively young startup, we have developed and deployed an LLM-based telephone service with over 40 million calls.

An open source tool to speed up iOS app launch

What do the Snapchat, Airbnb, and Spotify iOS apps have in common? They all use order files to speed up their iOS app launch times. Order files re-order your binary to improve how symbols are loaded into memory. No code changes are necessary, but generating an optimized order file can be cumbersome, so it’s mostly done by larger teams or teams willing to pay for a service like Emerge Tools’ Launch Booster. It just so happens that Emerge Tools is now part of Sentry.

Log Management and Query Optimization in Kibana

When troubleshooting with the Elastic Stack, Kibana is often the interface you’ll rely on to query and visualize logs. It doesn’t change the data—it just makes it searchable and a bit easier to work with under pressure. If you’re investigating an outage, tracking performance issues, or trying to correlate events across services, Kibana’s log exploration tools can speed up the process, assuming they’re configured and used well.

Mastering Global Telemetry: How Cribl Puts You in Control

Let’s face it: managing global data infrastructure isn’t just hard, it’s “I-just-deployed-the-wrong-config-to-prod-again” hard. If you’re a Cribl Admin or Operator working across clouds, continents, and compliance regimes, your to-do list probably reads like a series of increasingly desperate Post-it notes. Sources. Destinations. Pipelines. TLS settings. Proxies. Dev, staging, prod. Repeat. Forever. But what if we told you there’s a better way?

7 Critical Insider Threat Indicators and How to Detect Them

Cybersecurity threats don’t come solely from external attackers. Insider threats also require your attention. Insider risk originates from employees, contractors or business partners who possess legitimate access to IT systems for their work tasks. They can access valuable data and systems that, if exposed or have some data stolen, could harm an organization’s reputation.

The hype is over: Generative AI is driving the evolution of search within enterprises

Discover how Accenture and Elastic are helping businesses seize the opportunities offered by generative AI When it comes to generative AI, enterprises need to think big. Shaving a few seconds off the time needed to draft an email is helpful, but the journey to real value begins when you apply AI at the enterprise level. A new partnership between Accenture and Elastic combines technical expertise and strategic excellence, enabling businesses to build the data foundations for a successful AI future.

Configure and customize Kubernetes Monitoring easier with Alloy Operator

What if you were to tell Kubernetes Monitoring what you wanted, and the system configured collectors based on your choices? We wondered that as well—wondered enough to create Alloy Operator and its Helm chart for version 3.0 of the Kubernetes Monitoring Helm chart. We’re excited to share that the new Kubernetes Monitoring Helm chart is now available, and it introduces a dynamic way of setting up your telemetry data collection with Alloy Operator.

Why Modern Incident Response Strategies Need Network and Service Intelligence: Part 2

In Part 1, we explored how aligning network visibility with IT service context empowers faster, smarter incident response. But what does this actually look like? Here in Part 2, we’ll go deeper into the challenges of traditional monitoring approaches, and how teams should look to move from fragmented alerts to unified insights – because when ITOps and NetOps can both see the “what” & “why” of the problem, actions become instinct.

Guide for Catching Regressions with GitHub Actions and CI/CD Monitors

This guide aims to help your team shift testing left, simulate real user behavior, and catch critical issues early as part of CI/CD, prevent regressions from reaching production by automating tests as part of your CI/CD and aborting deployments that contain issues. Synthetic monitoring is a great way to check important flows in production and make sure everything is working the way it’s supposed to.

Optimize Your Event Analysis: Reports, Dynamic Filters, and Log Parsing in Pandora FMS SIEM

The latest Pandora FMS version presents key improvements to the SIEM, module, designed to enhance security event detection and management. These new features are available starting with Feature Release 782, allowing for optimized log analysis, report generation, and rule validation in distributed IT environments.

Azure CDN for Static Assets, APIs, and Front Door

If your users are spread across the globe but your servers are sitting in Virginia, you’ll probably hear complaints about slow load times, especially from places like Australia. CDNs fix this by caching static assets closer to where your users are. Azure CDN does exactly that, and it fits well if you're already using Azure services. You can hook it up to Blob Storage, App Services, or your origin. This guide covers how to set it up, what to expect, and how to know it’s working.

Seer, Sentry's AI Debugger, is Generally Available

Tired of trying to guess if that half-baked LLM suggestion is really going to fix the issue with your code? Meet Seer—our new AI agent that taps into all the issue context from Sentry and your codebase to not just guess, but root cause gnarly issues and propose merge-ready fixes specific to your application. Code gen tools are great fun—and useful. But even a recent Microsoft study confirmed what you already know: AI struggles with debugging.

How Network Configuration Automation Improves Security and Efficiency

Let’s face it: the modern enterprise network is a leviathan. No longer just a collection of routers and switches, today’s networks span multiple clouds, hundreds of SaaS applications, and countless IoT devices—supporting a workforce that could be anywhere.

Change Management in Pandora ITSM with Full Traceability and Custom Workflows

With version 106 of Pandora ITSM, a critical feature has been introduced for technology environments operating under security frameworks, regulatory compliance, and efficient management: Change Management. This new module allows changes to be registered, approved, implemented, and closed in a structured way, with full traceability and responsibility control.

AutoCon3: Network Automation's Premier Conference

AutoCon3 in Prague offered important takeaways on network automation’s evolution, from hands-on learning and design principles to the impact of AI and the power of community. Read Justin Ryburn’s recap to learn about key insights from the event, showing why network automation is now a core competency you’ll want to understand.

How InfluxDB 3 Enterprise Delivers 10-Millisecond Queries Over Historical Time Series Data

Time series data, such as IoT sensor readings or stock market ticks, flow in fast, often at a rate of millions of points per second. Querying this data, especially years of historical records, can be slow and painful if using a nonspecialized database rather than a time series database like InfluxDB.

Best Network Traffic Generator and Simulator Stress Test Tools

Benchmarking the environment of a new network is a crucial part of ensuring its success when it goes live. This includes stress testing and generating traffic on existing networks, both of which help you to identify any potentially flawed or vulnerable areas—for example, drops in connection and packet loss. As we know, network traffic is critical to the success of a business, as it determines how data flows and how effectively your applications interact.

Webinar Snippet: Internet Troubleshooting with Obkio's "Sandwich Method"

This is a snippet from our full webinar: “Troubleshooting Internet Issues: For Dummies & IT Pros” In this clip, we dive into Obkio’s Sandwich Method, a simple yet powerful approach to monitoring and identifying Internet issues. By placing Monitoring Agents: Inside your LAN At your firewall or in the DMZ And over the Internet…you can break your network into clear segments and pinpoint exactly where performance problems are happening — whether it’s local, at the network edge, or in the hands of your ISP.

You have 3 seconds... that's it.

You have 3 seconds... that’s it. Today, users lose patience fast. A 3 second delay in page load time leads to 40% of users abandoning your site. This leads to damaged reputation, decrease in customer trust, and loss of revenue. What does that mean for you? Every millisecond counts. If you're not measuring your performance from your users' point of view, you might be missing a chance to convert them into customers.

16 common mistakes C#/.NET developers make (and how to avoid them)

As developers, we often fall into common pitfalls that impact the performance, security, and scalability of our applications. From neglecting data validation to overengineering, and from ignoring async/await to mishandling resource disposal, even experienced C# developers can make these mistakes. In this post, I've gathered some of the most frequent issues developers encounter in C# and how to avoid them with practical solutions.

Edwin AI Turns One: What a Year of Agentic AIOps Looks Like

Twelve months ago, we shipped Edwin AI with a specific hypothesis that AI agents could handle the operational drudgery slowing down ITOps teams. It was a deliberate bet against the cautious consensus that AI should act only as a copilot, limited to offering suggestions. Most AIOps tools still follow that script. They’re stuck surfacing insights and stop short of action. Edwin was built differently. It was designed to make decisions, correlate events, and execute fixes.

Introducing Netdata Insights

Now in research preview: Netdata Insights The problem: Incident? You're jumping between dashboards, piecing together timelines. Reporting? You're copy-pasting charts and correlating trends by hand. The data’s there, but turning it into a narrative doesn’t scale. The solution: Netdata Insights. Synthesizes our high-fidelity telemetry using the latest LLMs into AI-powered reports with natural-language explanations, visuals, and clear recommendations.

Overview of Dashboard

Get a complete overview of the Uptime.com Dashboard in this video! Learn how to monitor your checks, customize the layout, and analyze global uptime metrics and alerts. See how to view response times, sort check cards, manage alerts, adjust auto-refresh settings, and save multiple personalized dashboards. Whether you're tracking uptime, organizing by tags, or customizing alerts, this guide covers everything you need to make the most of your dashboard.

Monitor Your Kubernetes Cluster: Get Started in Four Minutes

For enterprises embracing Kubernetes, managing these intricate environments can pose significant challenges. Thankfully, monitoring of Kubernetes clusters is readily achievable using the Universal Monitoring Agent (UMA) in conjunction with DX Operational Observability (DX O2).

The role of network automation in AI-driven businesses

AI adoption is accelerating across nearly every industry. According to McKinsey’s 2025 State of AI report, 78% of organizations now use AI in at least one business function, up from just 55% the year prior. From real-time analytics to generative tools and process automation, AI is becoming a fundamental part of how modern businesses operate and compete.

Tales From the Trench: Building With LLMs and Honeycomb

AI discourse these days is all over the place. Depending on who you talk to, AI’s are absolute flash-in-the-pan junk, or they’re the best thing since sliced bread. I want to cut through the noise, though, and see for myself what someone can do out here on the bleeding edge. Thus, I’m setting myself a challenge: write a usable—and useful—application with Claude Code, from soup to nuts. Here are the rules: With our ground rules established, let’s figure out our app!

Adaptive alerting: faster, better insights with the new metrics forecasting UI in Grafana Cloud

In Grafana Cloud, we offer a range of AI capabilities to support your observability needs, including a feature for forecasting on any of your metrics and coupling it with Grafana Alerting. This is critical functionality if you want to make the switch from reactive to proactive alerting, as troubleshooting a problem before it arises is an important part of modern observability.

Kubernetes CPU Limit: How to Set and Optimize Usage

Kubernetes makes it easy to scale applications. But when it comes to CPU resource management, a poorly tuned cluster can quickly become unstable or inefficient. For network engineers, setting CPU requests and limits correctly—and understanding the deeper implications—is essential for keeping workloads efficient, costs predictable, and noisy neighbors in check.

Introducing Sentry's Flutter SDK 9.0 - Logs, Session Replay, Feature Flags, and more

If you've ever had to debug a Flutter app after an error report that just says “Null check operator used on a null value,” you already know: context is everything. And context can be hard to come by when you’re juggling native code, Dart, async stack traces, and platform channels. With v9 of our Flutter SDK, we’re introducing some features to help you get even more visibility into what’s going wrong, with the insights to make it better. Here’s what’s new.

Defining SLA/SLO-Driven Monitoring Requirements in 2025

SLA/SLO-driven monitoring aligns your observability strategy with business objectives by defining measurable service targets and implementing monitoring systems that track progress toward those goals. Service Level Agreements (SLAs) represent commitments to users, while Service Level Objectives (SLOs) are internal targets that ensure you meet those commitments with a safety buffer. In 2025, organizations running distributed systems need monitoring that goes beyond basic uptime checks.

Serverless vs. Containers: A Comprehensive Guide to Choosing the Right Solution

In the rapidly evolving world of cloud computing, network engineers often need to decide between serverless computing and containerization. Both technologies offer unique advantages and are suited to different types of applications. This article aims to provide a comprehensive comparison of serverless computing and containers, helping network engineers make an informed decision based on their specific needs.

Cloud Cost Optimization Best Practices, Strategies, and Tools to Reduce Bills

As network engineers, you play a crucial role in managing cloud infrastructure that supports your organization’s applications and services. Cloud platforms offer immense flexibility and scalability, but without careful cost management, expenses can quickly spiral out of control.

OpenTelemetry for Go: measuring the overhead

Everything comes at a cost — and observability is no exception. When we add metrics, logging, or distributed tracing to our applications, it helps us understand what’s going on with performance and key UX metrics like success rate and latency. But what’s the cost? I’m not talking about the price of observability tools here, I mean the instrumentation overhead.

Observability trends in Japan: Insights from Grafana Labs' latest survey

Japanese organizations are focused on controlling costs and limiting complexity—and they might be getting ready to broaden their adoption at just the right time, according to analysis of a micro survey on observability recently conducted by Grafana Labs. Observability is an evolving space in Japan, and this is the first time Grafana Labs has run a Japanese version of our annual Observability Survey.

How to Set Up a Syslog Server: A Complete Step-By-Step Guide

Syslog servers are essential for centralized log management, helping network engineers monitor, troubleshoot, and secure network devices efficiently. This guide walks you through setting up a syslog server from scratch, focusing on practical steps using rsyslog on a Linux system—a common and robust choice for syslog collection. Windows does not have a native syslog server, so you need third-party software.

How to reduce Cloud Costs (with Open Source!)

We strongly believe that simple observability should be an innovation everyone can afford to benefit from: which is why Coroot is open source, and includes cost monitoring for Azure, GCP, AWS, or your own custom settings. eBPF automatically tracks how each deployment impacts your cloud costs, so you can easily roll back changes and avoid lovecraftian monthly bill when necessary.

Everything You Need to Know About Event Logs

Your code passes locally, CI is green, and the deploy goes through. Then production throws a 500, and the trace isn’t helpful. And here, event logs help. A log captures timestamped records of what the app did HTTP requests, DB queries, cache misses, retries, failures. These entries give you enough context to debug without reproducing the issue locally. Especially when dealing with distributed systems, logs are often the only consistent source of truth.

How to Use an SLA Uptime Calculator to Understand Service Availability

TL;DR A Service Level Agreement (SLA) defines the required uptime for a service. An SLA uptime calculator helps convert uptime percentages into actual allowed downtime across different timeframes. This guide explains how these calculators work, why uptime matters, and how to monitor performance to meet SLA targets.

New: Status modal integration is here!

At StatusGator, our mission is to make status transparency effortless—for you and your users. Today, we’re introducing a new way to keep your users informed in real time: the Status Modal Embed. Now you can display a compact, customizable modal on your website that shows the current status of your services—incidents, maintenance, or full operational status—all with a direct link to your full status page.

A guide to PHP exception handling

In most object-oriented languages, exceptions are an extremely powerful mechanism for dealing with unexpected situations that arise when running your code. PHP has supported robust exception handling since PHP 7.0. As you begin your programming journey, exceptions are a source of tremendous pain. Over time, you grow to appreciate the value they bring.

Invisible dependencies, visible impact: Lessons from the Google Cloud outage

June 12, 2025. A date most of the Internet won’t remember — but anyone relying on Google Cloud will. In the span of minutes, a routine quota update snowballed into global disruption. APIs stopped responding. Dashboards stayed green. And across continents, teams scrambled to figure out if the problem was theirs — or Google's. It wasn’t a cyberattack. It wasn’t a datacenter fire.

Lumigo Copilot AI Launches to Automate Root Cause Analysis and Remediation

Today, we’re announcing the general availability of Lumigo Copilot, the most intelligent AI-powered observability assistant on the market, built for the complexities of modern microservices. Copilot emerged from a simple realization: Distributed systems produce too much fragmented data across too many layers, making troubleshooting slow, reactive, and deeply manual. Copilot changes that.

What's Slowing You Down? How Intelligent Operations Accelerate Business Transformation

Your organization has a bold modernization roadmap. Cloud migration. Application updates. Enhanced customer experiences. New revenue streams. The business case is compelling, the stakeholders are aligned, and the budget is approved. Yet six months in, progress feels sluggish. The cloud migration is behind schedule due to performance issues no one anticipated. Application modernization stalled when the team discovered integration complexities that weren’t apparent during planning.

Grafana Tempo 2.8 release: memory improvements, new TraceQL features, and more

Grafana Tempo 2.8 is officially here, delivering new TraceQL features, performance improvements, and bug fixes, as well as some breaking changes. Watch the video below to learn more about the TraceQL features, or continue reading to get a quick overview of these and other updates. If you’re looking for something more in-depth for all of the changes that happened in this release, head over to the Grafana Tempo 2.8 release notes or the changelog.

Beyond Storage: How Time Series Databases Are Becoming Intelligent Data Engines

Data isn’t just a record of what happened—it shapes what happens next. Across industries, connected devices continuously stream time-stamped data that reflects the current state of machines, environments, and systems. This steady flow gives businesses a live view of their operations and the opportunity to catch issues early, adjust quickly, and operate more efficiently.

Fluent Bit Helm Chart: Simplify Log Collection in Kubernetes

Collecting logs in Kubernetes often starts as a simple goal, and quickly turns into a game of “where did that log line go?” Between sidecars, DaemonSets, and countless config options, it’s easy to get lost. Fluent Bit helps cut through the noise. It's fast, lightweight, and plays well with Kubernetes. And when you deploy it using Helm charts? The setup becomes way more manageable. This guide covers the how and the why, without overcomplicating the what.

Getting started with HaloPSA dashboards

The HaloPSA plugin is a new addition to SquaredUp, and helps you create live dashboards that surface the important metrics – giving you and your team a single pane of glass for help desk performance, asset visibility, and client reporting. Why it matters: If your team uses HaloPSA to manage tickets, assets, and clients, then you already know how vital that data is for running smooth operations.

Monitoring your Nextjs application using OpenTelemetry

Nextjs is a production-ready React framework for building single-page web applications. It enables you to build fast and user-friendly static websites, as well as web applications using Reactjs. Using OpenTelemetry Nextjs libraries, you can set up end-to-end tracing for your Nextjs applications. Nextjs has its own monitoring feature, but it is only limited to measuring the metrics like core web vitals and real-time analytics of the application.

Could your Palo Alto firewall do more to protect you against Shadow AI?

In recent months, my conversations with fellow technology leaders have consistently revolved around two key themes: how we leverage AI to drive innovation and efficiency, and how we mitigate the inherent risks associated with AI. However, I’ve noticed a concerning gap – while enterprises are busy strategizing the adoption of AI to enhance productivity, reduce costs, and outpace competitors, very few are addressing how AI is being actively used today by their own teams.

Boosting your AWS monitoring ROI: Strategies that deliver

AWS gives you the power to scale, deploy, and innovate at speed. However, with that speed comes a good amount of complexity. Services multiply, resources balloon, and performance issues sneak in when you least expect them. That’s where monitoring comes in. But it isn’t about checking boxes on dashboards. It’s about getting the most value for every dollar you spend or, maximizing your return on investment (ROI) from AWS monitoring. So, how do you actually do that?

Ops Explained: AIOps vs. DevOps vs. MLOps vs. Agentic AIOps

There’s a common misconception in IT operations that mastering DevOps, AIOps, or MLOps means you’re “fully modern.” But these aren’t checkpoints on a single journey to automation. DevOps, MLOps, and AIOps solve different problems for different teams—and they operate on different layers of the technology stack. They’re not stages of maturity. They’re parallel areas that sometimes interact, but serve separate needs.

The 1st Successful Commercial Moon Landing | Firefly's Blue Ghost Mission 1 | Grafana Everywhere

Firefly’s Blue Ghost Mission One successfully landed on the moon with the help of Grafana. In this behind-the-scenes talk, learn how real-time dashboards powered critical decisions during descent, tracked payloads, and helped operators visualize everything from footpad sensors to lunar gravity. Footage and photos courtesy of Firefly Aerospace.

Elastic - The Search AI Company

You may not know it, but you probably use Elastic every day. By combining the transformative power of AI with our deep expertise in search and vector databases, we are changing what's possible with search. Our Search AI Platform empowers organizations to have a conversation with all their data, build powerful GenAI applications, immediately diagnose root causes in observability, and hunt for threats at enterprise scale.

Top Five Reasons Telemetry Pipelines Should Be on Every Engineer's Radar

You’ve probably felt the pain: data pouring in from every corner of your stack, tools choking on volume, dashboards lagging behind reality, alerts firing (or worse, not firing) without context. If that sounds familiar, it’s time to get serious about telemetry pipelines. Whether you're an SRE trying to stabilize a flapping service or a developer navigating multi-cloud chaos, a telemetry pipeline helps you take control of the data firehose.

Datadog + OpenAI: Codex CLI integration for AIassisted DevOps

We are exploring how we can help on-call engineers troubleshoot incidents more effectively by providing the OpenAI Codex agent with access to real-time observability data in terminals. We've developed an integration and new tool visualizations that connect OpenAI's Codex CLI to the new Datadog MCP server. In this post, we'll share what we've been experimenting with: enabling an AI agent to retrieve production metrics, logs, and incidents from Datadog in real time and act on that context.
Sponsored Post

The Network-First Advantage: How Fabrix.ai Redefines Observability from the Ground Up

Modern enterprises today often find themselves in a peculiar predicament: they are drowning in a deluge of telemetry data—including logs, metrics, and traces—yet paradoxically remain blind to what truly matters. Despite making substantial investments in observability tools, teams frequently find themselves reacting to incidents rather than proactively preventing them, with alerts flooding dashboards often devoid of critical context.

Atlassian Confluence Monitoring on Microsoft SCOM

As part of a customer project, we developed a custom Confluence Management Pack for Microsoft System Center Operations Manager (SCOM). This tailored solution enables IT operations teams to monitor key performance and health metrics of Confluence environments, ensuring knowledge-sharing platforms remain available and performant.

5 Ways to Optimize Your OpenSearch Cluster

OpenSearch is a powerful, scalable search and analytics engine that can do amazing things for logging, observability, and full-text search. But like any distributed system, it only performs well if you keep it properly tuned and healthy. Ignore it, and you risk slower queries, higher costs, and even data loss. Here are five practical tips to keep your OpenSearch cluster running smoothly and efficiently.

Guided by Trust: ScienceLogic Earns TrustRadius Top Rated for the Sixth Year Running

In a world where IT complexity is accelerating, trust has never been more essential. At ScienceLogic, trust isn’t just a value—it’s our compass. It guides how we innovate, how we serve, and how we grow alongside our customers. That’s why we’re proud to share that ScienceLogic SL1 has once again been named a Top Rated product on TrustRadius—for the sixth consecutive year. This recognition is more than a milestone.

Is it the network... or the CDN?

When performance issues strike, the finger pointing begins. But here's the catch: CDNs aren't just "someone else's responsibility." They directly impact the user experience, and if they're misbehaving, your network team will be the first to get the call. That’s why CDN monitoring is essential. CDNs are dynamic and performance can vary dramatically across regions, ISPs, or even end users. When something goes wrong, it looks like a network issue, unless you have visibility into CDN behavior.

An Easy Guide to Getting Started with Elastic APM

Code in production will break. Maybe a request takes too long, maybe it fails quietly, or maybe it works fine one minute and falls over the next. Logs can help, sure—but they don’t always show the full picture, especially when performance issues are involved. Elastic APM gives you a clearer view. It traces what your application is doing from incoming requests to database queries and everything in between.

No Sandwich, No Security: What This Week's Lunch Taught Me About DNS Blind Spots

Like many shoppers in the UK this week, I found myself staring at half-empty shelves in my local grocery store. In a small but frustrating twist, my usual sandwich, chicken mayo on malted bread, was nowhere to be found. The disruption wasn’t just about lunchtime preferences; it was part of a broader impact from cyberattacks that hit major UK retailers, including Co-op and Marks & Spencer.

How to Configure Lightweight Browser Tracing for Debugging at Scale

Sentry’s auto-instrumentation, using BrowserTracing, is convenient. You can get interesting insights about your frontend application out-of-the-box, such as whether slow and failing API calls are hurting your user experience (summarized in Network Requests), or how your website stacks up against industry standards for performance (summarized in Web Vitals).

The Architecture Loop: How Early Can We Decide Speed, Stack and Scale?

In 2025, many companies are reckoning with the true cost of microservices, especially as cloud bills grow and engineering teams face coordination fatigue. The move back to monoliths is gaining traction, particularly for startups and mid-sized businesses who need: ‍ At Scout APM, we’ve been thinking about these shifts not just from a monitoring perspective, but from a broader architectural one.

The best of both worlds with the Splunk Cloud Platform

This video describes how the value of migrating to the Splunk Cloud Platform provides a comprehensive environment that offers everything from efficiency and sustainability to agility and security plus and lower your costs. How can you be sure? With the Splunk Cloud Calculator we’ll show you the real dollar savings you could get from migrating to the Splunk Cloud Platform.

7 critical Active Directory metrics every IT admin should monitor

Across vast enterprise networks, Active Directory (AD) serves as the foundational layer for identity and access management. It's the critical service enabling user authentication, managing authorizations, and ensuring smooth operations across your network. Given its central role, any hiccup in AD can lead to widespread outages, security vulnerabilities, or frustrating user experiences.

Data points per minute in Grafana Cloud: What you need to know about DPM

If you’re working with metrics in Grafana Cloud, chances are you’ve come across DPM (data points per minute). It shows up in usage dashboards, invoice breakdowns, and occasionally pops up in Slack when your ingestion numbers start looking suspicious. DPM can also be seen in the Grafana Cloud billing and usage dashboard, which is available by default in every Grafana Cloud account. It helps you understand how much data you’re sending—and whether it’s more than you need.

Achieving Comprehensive Network Observability for VMware Cloud Foundation

Private cloud infrastructure adoption is accelerating rapidly. This move is driven by the ongoing “cloud reset” as leaders rethink their hybrid and multi-cloud strategies, seeking greater control, security, and flexibility for their IT workloads. As a matter of fact, leaders in 69% of organizations are considering repatriating workloads, and one-third already have.

Cisco and Splunk Strengthen Enterprise Digital Resilience in the AI Era

In an era where hybrid environments and AI-driven innovations redefine enterprise operations, organizations face increasing complexity, disruption, and vulnerability in their systems. To overcome this growing challenge, Cisco and Splunk are working together to harness the power of AI to help customers ensure that digital resilience is an inherent part of their systems.

Yes, Sentry has an MCP Server (...and it's pretty good)

Unless you’ve been living under a rock, “MCP” is probably a term you’ve heard thrown around in the AI space. Each of the editors and LLM providers have been racing to add and enhance their MCP support. Sentry was fortunate enough to be included in Anthropics release announcements for MCP.

Implementing Grafana Play privacy policies with Grafana k6: A behind-the-scenes look

Grafana Play is a free and publicly accessible sandbox environment that allows users to explore and learn Grafana without setting up their own instance. Grafana Play comes preloaded with ready-made sample dashboards, and showcases how to work with different data sources, create visualizations, and use advanced Grafana features.

Getting OpenTelemetry Data Into Graylog

OpenTelemetry is emerging as the common framework for collecting observability data, and for good reason. It’s vendor-neutral, open source, and designed to collect traces, metrics, and logs in a consistent way. But while most of the buzz is around tracing and metrics, let’s not forget: logs are still the backbone of investigation and response. That’s why Graylog now supports native collection of OpenTelemetry data over gRPC.

The truth you can't afford to miss: Listen as your logs spill the tea

When you hear “spill the tea,” you probably think of pop culture, not outages or anomalies. But the origin may surprise you: before it was slang for juicy gossip, ‘tea’ was actually ‘T,’ which represents truth. We know what you’re thinking: “Are you trying to say ‘spilling the tea’ is a good thing?” And yes, that’s exactly what we’re saying, especially when your logs are doing the talking.

Why companies keep migrating to Coralogix

As businesses scale, so do their observability needs, but many find themselves stuck with costly, inflexible platforms that no longer serve them. Despite mounting frustrations, the complexity of migration keeps companies from making a change. The risk of losing critical data, disrupting workflows, or rebuilding everything from scratch often outweighs the benefits of switching. Most vendors offer little to no migration support, forcing teams to manually reconfigure dashboards, alerts, and integrations.

Accelerate Oracle Cloud Infrastructure monitoring with Datadog OCI QuickStart

Datadog’s Oracle Cloud Infrastructure integration enables you to collect metrics and logs from your entire OCI stack and monitor them within a single platform alongside other third-party technologies. Datadog’s new OCI QuickStart is a fully managed, single-flow setup experience that helps you monitor your OCI infrastructure and applications in just a few clicks.

The Mindset Shift: IT Operations to Security - SolarWinds TechPod 099

In this episode, hosts Sean Sebring and Chrystal Taylor engage with actual rock star Chris Greer, a Security Engineering Manager at SolarWinds, to explore the multifaceted world of cybersecurity. Chris shares his unconventional journey from being a musician to entering the IT field, emphasizing the importance of certifications and the mindset shift required when transitioning from IT operations to security.

Integrations made easy with VictoriaMetrics Cloud

VictoriaMetrics Cloud continues to evolve as the most efficient, scalable and open platform in the observability landscape. In our last Q1 update blogpost, we shared new features such as seamless OpenTelemetry integrations, new Organizations support, and improvements in the Explore UI and APIs. This time we wanted to take a minute to showcase how we’re taking the interoperability journey very seriously. Integrations in VictoriaMetrics Cloud Haven’t tried VictoriaMetrics Cloud yet?

CI/CD Observability with OpenTelemetry - A Step by Step Guide

In the fast-paced world of CI/CD, understanding the performance and behaviour of your pipelines is crucial. GitHub Actions has become a popular choice for automating builds and deployments, but anyone who's debugged a flaky workflow or long-running job knows how challenging it can be to get visibility into what's happening under the hood. We usually rely on build logs, timing data, or guesswork when something goes wrong.

Built for Impact: What Happens When LogicMonitor Edwin AI Meets Infosys AIOps Insights

Today’s IT environments span legacy infrastructure, multiple cloud platforms, and edge systems—each producing fragmented data, inconsistent signals, and hidden points of failure. This scale brings opportunity, but also operational strain: fragmented visibility, overwhelming alert noise, and slower time to resolution. With good reason, public and private sector organizations alike are moving beyond basic visibility, demanding hybrid observability that’s context-aware and action-oriented.

DASH by Datadog 2025 Keynote

At the 2025 DASH Keynote and be the first to experience Datadog's latest product innovations. This year, we're unveiling next-generation observability features, innovative ways to secure your AI workloads, and powerful agentic AI capabilities throughout the Datadog platform. Discover the new ways your teams can observe, secure, and act in the age of AI.

Create and monitor LLM experiments with Datadog

To efficiently optimize your LLM application before pushing to production, you need a comprehensive testing and evaluation framework. By running experiments, you can optimize prompts, fine-tune temperature and other key parameters, test complex agent architectures, and understand how your application may respond to atypical, complex, or adversarial inputs. However, it can be difficult to manage your experiment runs and aggregate the results for meaningful analysis.

Introducing Bits AI SRE, your AI on-call teammate

Getting paged pulls engineers away from meaningful work, yet incident response in many organizations remains manual, reactive, and draining. An alert fires and teams scramble to find the root cause, relying on siloed knowledge, incomplete context, and a few on-call experts who are already stretched thin. The rise of AI coding agents has only intensified this challenge: As teams ship code faster with less human oversight, production systems grow increasingly complex and harder to understand.

How To: SLA Monitoring & Reporting: Are You Getting What You Paid For?

Are you tired of feeling like you're in the dark about the services you're paying for? Are you getting what you paid for? Many businesses are in the same boat when it comes to Service Level Agreement (SLA) monitoring and reporting. It’s great having an SLA (or Service-Level Agreement) for the provision of a service, but you need to go further to really understand if the standards specified in the SLA are actually being met. That’s where SLA monitoring and reporting comes in.

Moving from Relational to Time Series Databases

I’ve been building apps with SQL Server for years. Everything worked well until I started dealing with sensor data, stock trade volume, and IoT telemetry. As the volume of time-stamped records grew into the millions, I saw relational databases struggling with workloads they weren’t designed for. That’s when I explored time series databases. The performance improvements were significant, but what surprised me was the mental shift required.

Datadog MCP Server: Connect your AI agents to Datadog tools and context

As development teams adopt AI-powered tools and build services that make use of AI agents, they want to extend their AI capabilities to incorporate familiar tools and observability data. However, AI agents struggle with regular API endpoints and frequently fail when parsing complex nested JSON hierarchies or incorrectly handling errors. As a result, these agents often fail to retrieve relevant results.

Optimize and troubleshoot AI infrastructure with Datadog GPU Monitoring

As organizations bring more AI and LLM workloads into production, the underlying GPU infrastructure that supports these workloads becomes even more critical in ensuring these workloads remain fast, reliable, and scalable. Inefficient GPU resource usage, for instance, can lead to longer runtimes and reduced throughput, negatively impacting overall model performance. Additionally, idle and underutilized GPUs can quickly drive up costs and lead to needless spending.

How to Monitor Kafka Producer Metrics

Your Kafka producer pushed a million messages yesterday. Nice. But can you tell if they all made it? Or why did latency spike at 2 PM? Producer metrics help you determine that. They expose how long messages take to send, whether messages are getting stuck, and whether retries are piling up. Let’s go over which ones help while debugging and how to monitor them.

Automatically identify issues and generate fixes with Bits AI Dev

Developers lose hours each week to a familiar troubleshooting loop: chase down telemetry across dashboards, decipher vague errors, and juggle alerts to find the signal worth fixing. Production issues, performance regressions, and security vulnerabilities all demand attention, but they often come with little context for taking action.

Improve performance and reliability with Proactive App Recommendations

As your organization grows, you may operate in increasingly complex environments and manage more services and larger teams to maintain them. Evolution like this can lead to an explosion of telemetry data from across your stack, including metrics, traces, logs, and frontend interactions. The benefit of greater visibility is often outweighed by the challenge of acting on the data you collect, and you can easily fall behind on implementing the fixes your services require to operate reliably and efficiently.

Introducing Seer: Sentry's AI Debugging Agent

There's a lot more context to an error than the message blinking in red on your screen. Seer understands the context of your application and everything behind that error. Seer collects information from the Stack Trace, Logs, Traces and Spans, Profiles, and the code from your GitHub repo and uses it to understand what's causing your issues, and propose fixes.

Ensure trust across the entire data life cycle with Datadog Data Observability

As data systems grow more complex and data becomes even more business-critical, teams struggle to detect and resolve issues that impact data quality, reliability, and, ultimately, trust. Engineers have to rely on manual checks and ad hoc SQL queries to catch data quality issues—often after teams relying on the data have noticed something has gone wrong.

How IPM helped a top tech brand catch an OpenAI outage before it became a crisis

Today’s digital businesses are more interconnected than ever. Industry research shows that 74% of organizations now take an “API-first” approach, and the average application is powered by between 26 and 50 APIs. While this accelerates innovation, it also introduces new risks: when an external provider fails, the impact can be immediate and far-reaching.

An Autonomous Ship is Set to Circumnavigate the World Using Docker, Grafana, & Starlink: Project Bob

Join Andrew McCalip of Varda Space Industries as he builds Project Bob—a DIY, solar-powered, autonomous ship aiming to circumnavigate the globe using open source tools like Grafana, Raspberry Pi, and Starlink.

How to Collect .NET Application Logs with OpenTelemetry

Observability is essential for maintaining and scaling modern applications. With.NET 8, Microsoft has enhanced support for observability using OpenTelemetry. In this post, we explore how to monitor.NET 8 applications logs with SigNoz, an open-source observability platform, using the OpenTelemetry Protocol (OTLP) exporter.

Top 15 Distributed Tracing Tools for Microservices in 2025

In one of our previous blogs, we discussed distributed tracing in depth. We examined why distributed tracing is critical and its components - spans and trace context. You can check the complete guide here: What is Distributed Tracing and How to Implement it with Open Source? Here, we'll look at some of the best distributed tracing tools. We'll see what each of them offers so that you can choose the right tool for your monitoring and observability requirements.

Top 13 Open Source APM Tools [2025 Guide]

Choosing the right APM tool is critical. How do you know which is the right one for you? Here are the top 13 open-source application performance monitoring(APM) tools that can solve your monitoring needs. Open-source APM tools have added benefits over their SaaS counterparts. They are more transparent, as you can verify their source code, and you can use them without going through the pains of obtaining approvals usually required for using a third-party vendor tool.

Auto-Instrument Everything with eBPF: Grafana Beyla + OpenTelemetry in Action | Homelabs

Grafana Beyla is a powerful eBPF-based auto-instrumentation tool for application and network observability. In this session, see how Beyla captures RED metrics and traces with zero code changes, and how it fits into the OpenTelemetry ecosystem. Perfect session for SREs, devs, and home labbers alike.

You Can Build Your Own AI Agent for ITOps-But Should You?

Most internal AI projects for IT operations next exit pilot. Budgets stretch, priorities shift, key hires fall through, and what started as a strategic initiative turns into a maintenance burden—or worse, shelfware. Not because the teams lacked vision. But because building a production-grade AI agent is an open-ended commitment. It’s not just model tuning or pipeline orchestration. It’s everything: architecture, integrations, testing frameworks, feedback loops, governance, compliance.

Smarter Telemetry Pipelines: The Key to Cutting Datadog Costs and Observability Chaos

Log volume is exploding, costs are rising, and most teams are stuck duct-taping together short-term fixes. During our webinar, "Optimizing Log Management in Datadog: Cut Costs Without Losing Insights," we discuss how DevOps and engineering leaders are navigating the growing pains of observability, especially in environments where tools like Datadog are mission-critical but challenging to manage. Here’s a recap of the key takeaways.

Migrate historical logs from Splunk and Elasticsearch using Observability Pipelines

Migrating to a new logging platform can be a complex operation, especially when it involves both active and historical logs. Observability Pipelines offers dual-shipping capability, making it easy to route active logs to your new platform without disrupting your log management workflows. But migrating years worth of historical logs—which are critical for investigating security incidents and demonstrating compliance with applicable laws—requires a different approach.

It's The End Of Observability As We Know It (And I Feel Fine)

In a really broad sense, the history of observability tools over the past couple of decades have been about a pretty simple concept: how do we make terabytes of heterogeneous telemetry data comprehensible to human beings? New Relic did this for the Rails revolution, Datadog did it for the rise of AWS, and Honeycomb led the way for OpenTelemetry.

Mastering NodeJS Performance Monitoring - A Practical Guide using Open Source Tools

Node.js powers some of the fastest-growing web applications, but its single-threaded nature makes it vulnerable to memory leaks and CPU spikes. To keep your app running smoothly, especially in production, you need more than just web server logs — you need complete visibility across the entire stack.

How to Integrate OpenTelemetry Collector with Prometheus

Pulling observability data together is rarely clean. Metrics come from everywhere, formats vary, and making sense of it takes some work. OpenTelemetry Collector and Prometheus fit perfectly here. The Collector handles ingestion and processing from different sources, while Prometheus stores and queries the data. Simple, effective, and no vendor lock-in. In this blog, we cover how to integrate the Collector with Prometheus, common pitfalls, and ways to control costs.

A Complete Guide to Linux Log File Locations and Their Usage

Linux log files are text-based records that capture system events, application activities, and user actions. They're stored primarily in the /var/log directory and provide essential information for debugging issues, monitoring system health, and maintaining security. This guide covers the most important Linux log files and a few detailed techniques for reading and analyzing them.

Site24x7: Synthetic monitoring vs. Real user monitoring

Want to know the difference between synthetic monitoring and real user monitoring (RUM)? You're not alone. In this video, we break down both monitoring types, show how they work, and explain when to use each—so you can build a monitoring strategy that gives you full visibility into your website or application performance. Here’s what you’ll learn: Whether you're a DevOps engineer, SRE, or IT admin, this video will help you make smarter monitoring decisions.

Lunar-level observability: How Firefly Aerospace used Grafana to monitor its historic moon landing

On March 2, 2025, Firefly Aerospace made history. The company — a space services firm that offers safe, reliable, and economical access to space — completed the first fully successful lunar landing by a commercial provider with its Blue Ghost Mission 1. But behind the headlines and highlight reels was a team of dedicated engineers, years of preparation, and a mission control center outfitted with Grafana dashboards.

Successful Launch - Then Came The Problems

About 15 years ago, I worked at a company building network security appliances (with ARM-based network processors) and was responsible for the development of custom Linux firmware. The product launch was successful; we shipped and managed a large fleet of devices in the field. After a few firmware releases, we received alerts from the device management system telling us that there were intermittent problems. Remoted into the appliances but could not reproduce the error.

Top 5 Open Source Log Management Tools (and How to Choose the Right One)

Managing logs at scale is no longer just about storing text—it’s about gaining insights fast, keeping systems healthy, and troubleshooting in real time. With cloud-native architectures becoming the norm, the pressure is on for modern teams to adopt log management tools that are fast, scalable, and easy to use. But with so many options, how do you choose the right one?

The One Where We Show You Copilot Editor

Copilot Editor is like an AI-powered Rosetta Stone for telemetry. It helps Cribl users take raw, messy telemetry data and turn it into standardized, analytics-ready formats. The most important piece? It puts YOU in control. Our human-in-the-loop design means that users have full control over and visibility into what’s happening with their critical data, preventing AI-induced mistakes. Watch this fun demo with the AI product team to show Copilot Editor's true value to the average Cribl user!

Fluentd vs Logstash: In-Depth Comparison of Two Popular Log Collectors 2025

In modern observability stacks, log collection is a critical component. Among the most widely adopted logs collector are Fluentd and Logstash. Both tools are designed to collect, process, and forward logs to various destinations like Elasticsearch, Kafka, and cloud services. However, the differences between FluentD and Logstash lie significantly in their design, performance, plugin ecosystems, and user experiences.

MCP = Observability + Code, a Real-life Example

Our bot is hitting an error. We can see it in the distributed trace. Here, see what happened when we noticed it: Austin fired up Claude Code (hooked up to Honeycomb with its MCP tool) and got it to find the error, fix it, deploy, and check that the fix worked. It got a little overconfident at first, but the ending is happy. IRL this took 22 minutes; the video speeds up the AI agent interactions and cuts out waiting. This video includes Austin Parker, Jessica Kerr, and Ken Rimple.

DX Operational Observability: Five New, Powerful Capabilities

DX Operational Observability (DX O2), our next-gen AIOps and Observability product, continues to provide new features and enhancements for practitioners across IT. DX O2 delivers a host of enhancements designed to empower IT operations, DevOps, and SRE teams. In this post, I introduce five powerful enhancements, outline steps to get started, and describe some of the benefits, which include deeper insights, improved efficiencies, and a more unified observability experience. Here are the five enhancements.

Create rich, up-to-date visualizations of your AWS infrastructure with Cloudcraft in Datadog

As your cloud environment grows more complex and dynamic, it becomes more difficult to maintain up-to-date reference diagrams, visualizing its components, that are available to all teams. As a result, teams often end up lacking the visibility they need to understand, manage, and troubleshoot their cloud infrastructure and applications.

Database observability: How OpenTelemetry semantic conventions improve consistency across signals

Databases are a crucial part of modern systems, which means database observability is incredibly important, too. However, gathering information on them can be complex, variable, and tricky to instrument in a consistent way. OpenTelemetry is helping to change that, and one of the most important aspects in making it work is a set of shared rules called semantic conventions.

Top Features of Splunk Observability Cloud for Engineers

In this video we’ll walk you through a demonstration of Splunk Observability Cloud’s key capabilities. You’ll see how you can monitor Kubernetes cluster health in Infrastructure Monitoring, and alert on your services’ health using AutoDetect Detectors and Alerts. We’ll then take a look at traces and metrics in APM, and use Related Content to find correlated log entries of error traces. Then we’ll use AlwaysOn Profiling to troubleshoot long duration traces for our service.

Monitoring for Financial Services: Reducing Costs, Ensuring Reliability

Fintech has reshaped financial services, using technologies like machine learning and blockchain to deliver faster, smarter, more user-friendly experiences. Challenger banks, open banking apps, digital payments, and investment apps have set a new standard—leaving traditional institutions racing to keep up. But staying competitive isn’t just about building digital products—it’s about making them reliable.

Easy Method for Monitoring MinIO Performance Using Telegraf

MinIO is a high-performance, S3-compatible object storage server built for cloud-native applications. It’s open-source, lightweight, and incredibly fast which makes it a solution for developers who need to store and serve unstructured data like images, logs, or backups. Whether you’re building a self-hosted alternative to Amazon S3 or running MinIO as part of a local development pipeline, it fits into modern containerized environments.

What's Inside InfluxDB 3.1

InfluxDB 3.1 is now available for both Core and Enterprise editions, bringing significant improvements that make managing high-volume, high-velocity time series data even easier, faster, and more secure. InfluxDB 3 Core is the free, open source edition of InfluxDB 3—a high-speed, recent-data engine licensed under MIT and Apache 2. InfluxDB 3 Enterprise is the commercial version of Core, adding support for longer-term historical queries, high availability, enhanced security, and more.

The Brain Behind the Pings: Understanding the Pingmesh Control Plane

In today’s interconnected world, a fundamental question plagues every network administrator and SRE: “Is my network running well?” The answer, often elusive, is precisely what Pingmesh aims to provide. By deploying a vast fleet of specialized probe agents, Pingmesh continuously monitors critical network health metrics, including latency, packet loss, jitter, and custom reachability checks, providing an unparalleled view into your network’s performance.

Monitoring ECS Metrics: A Guide for Developers and Operations Teams

For anyone leveraging cloud computing, Amazon Elastic Container Service (ECS) continues to provide a seamless solution for managing containerized applications. AWS Fargate takes this cloud-native architecture a step further by allowing you to run containers without servers or clusters. As a serverless offering for ECS, Fargate provisions compute capacity and scales it based on demand.

From Downtime to Uptime: Monitoring Tools and Techniques for Systems, Websites, APIs, and More

Recently, while visiting a friend in a local hospital, I found myself facing a frustrating distraction: trying to pay parking fees using USSD (a mobile text-based system for quick transactions). The service was either painfully slow or not working at all. I wasn’t alone. Other visitors were just as exasperated, and parking attendants stood idle, their handheld devices frozen in endless loading loops.

AI + Dark Mode: Introducing AI-Powered Insights and The Long Awaited Dark Mode

Join the live stream at 11 am ET, here. Launch Week’s Friday drop delivers two of the most-requested upgrades we’ve ever shipped: Together, they turn Bindplane into a cooler , and smarter , place to manage observability and SecOps telemetry. A full suite of extensive AI features will be rolling out over the coming weeks. This is just the beginning!

3 Reasons Why You Should Use Custom Playwright Fixtures

In this video, Stefan Judis, Playwright ambassador, explains the power of Playwright fixtures while running tests in JavaScript or TypeScript. Learn how to streamline your test setup, remove repeated code, and leverage custom fixtures for cleaner and more efficient end-to-end tests. By the end of this video, you'll have a clear understanding of why you should use Playwright's native architecture to structure your testing project.

Working with GPUs on Kubernetes and making them observable

GPUs are everywhere powering LLM inference, model training, video processing, and more. Kubernetes is often where these workloads run. But using GPUs in Kubernetes isn’t as simple as using CPUs. You need the right setup. You need efficient scheduling. And most importantly you need visibility. This post walks through how to run GPU workloads on Kubernetes, how to virtualize them efficiently, and how Coroot helps you monitor everything with zero instrumentation or config.

Why You Need Real User Monitoring to Really Understand Your Web Performance

Great Lighthouse scores, but your site is still slow. Sound familiar? You’ve run PageSpeed Insights, Request Metrics, and every other synthetic test you can find. Your scores look great. But your analytics shows users bouncing, conversions dropping, and complaints about “slow pages.” What’s going on? The answer is simple: synthetic testing only tells you how your site performs in a test, not how it performs for real users in the real world.

Why Cribl Copilot Editor is Built for the Human, First and Foremost

I’m genuinely excited about what we're rolling out with Copilot Editor, an update to our AI that’s truly packed with new capabilities designed to help you automate pipeline development. You can read about these capabilities here. I wanted to take a moment to share our thinking on a core principle that guides how we build, especially regarding the impactful, and sometimes daunting, world of generative AI.

Blueprints: Ready-Made Processor Bundles For Your Telemetry Pipelines

We’ve noticed a lot of our customers spend countless hours building and configuring processors. Either parsing JSON, standardizing log formats, normalizing timestamps, masking PII, de-duplicating logs, the list never ends. Most work revolves around recreating the same processor bundles in multiple processor nodes. Bindplane’s new Blueprints solves that boring, repetitive work by providing pre-built processor bundles you can drop into any pipeline with a single click.

How to Configure and Optimize Prometheus Data Retention

Prometheus can be lightweight to start with, but once it’s in production, storage usage tends to grow faster than expected. Managing how long data is kept becomes critical, especially when you're working with limited disk space or tight budgets. This guide outlines the key concepts behind Prometheus data retention, how to configure it effectively, and what to watch out for.

Shift-Left Monitoring for GitHub and Vercel Workflows

A recent LinkedIn poll by Peter Zaitsev asked: “What is the most common preventable cause of downtime in your environment?” Guess what most respondents said it was? Surprise, surprise – the top answer is Deploying Broken Code, with 57% of respondents selecting it. This reinforces how critical it is to catch issues before they hit production.

How to Monitor Frontend Memory Usage

First of all, by frontend memory usage I mean the amount of memory that a user’s browser needs when using your website or webapp. Secondly, do you have any idea how much browser memory your website or webapp requires? Or do you know if or how much the memory footprint of your website/webapp has changed over the last few months? Or after the recent changes or releases you made? I’m guessing you don’t. Yet, this is important to monitor to avoid a bad user experience.

Announcing Go tracer v2.0.0

Datadog has long supported the monitoring of instrumented Go applications through our Go tracer v1. As the Go ecosystem has continued to mature, we’ve been hard at work collecting feedback and improving upon the tracer’s capabilities and usability features. We are now thrilled to announce the release of our Go tracer v2.0.0. This major update includes better security and stability, and a new and simplified API.

Beyond Shift Left: Engineering Leaders Increase Speed and Resilience With Observability

We recently had the privilege of hosting several industry experts and technology executives across platform strategy, SRE, and engineering enablement for breakfast at our Observability Day in London. We noted that they’re all facing the same fundamental tension: deliver faster, scale smarter, stay resilient, and somehow get ahead of what’s coming next. But how do you move fast without breaking things? And how do you prove the value of the things you don’t break?

Solve your MTTR mysteries faster with Sumo Logic

Picture this: a crime scene where the evidence is scattered across five different rooms. There’s a footprint in one, a shattered window in another, a stray shoe on the stairs, and a witness across the street, who only saw part of what happened. Each clue matters in solving the case, but none of them tells the full story on their own.
Sponsored Post

Smarter alerts using P75 for more signal and less noise

We've rolled out a new feature in Raygun Alerting that gives you more control over how you track and respond to performance regressions. Starting today, you can now use the 75th percentile (P75) as a filter option for page performance data in Real User Monitoring, such as Core Web Vitals and page load time, right alongside the default 'Average'. This option is available under the "Page/XHR performance change" condition and supports all the Web Vitals metrics we track: Let's break down why this matters, when you should use P75, and how it gives you better, faster insights into how real users are experiencing your site or app.

Scaling Observability: How We Designed Bindplane to Manage 1,000,000 OpenTelemetry Collectors

Join the live stream at 11 am ET, here. Platform teams tend to start with just one, or in some cases a handful of OpenTelemetry (OTel) Collectors usually running in gateway mode. They then embrace the benefit of a vendor-neutral, standardized, telemetry collector for unified logs, metrics, and traces.

Why Does Your Network Get Blamed When Trouble Lies Beyond the Firewall?

The familiar scene unfolds: Critical applications are sluggish, user complaints are mounting, and the IT war room is buzzing. Eyes quickly dart towards the network team. It’s an almost instinctual reaction. But what happens when the problem isn't within the corporate LAN or even the data center? What if the real culprit lurks somewhere in the vast, untamed wilderness of the internet, a cloud provider's backbone, or a third-party SaaS application’s infrastructure?

The 3 smart updates to our Jira plugin

The Jira plugin is one of our most-used integrations and for good reason. Teams rely on it daily to stay on top of work, manage issues, and ship on time. As more people leaned on it, we saw a chance to make the experience even smoother. So, we gave it an upgrade. We’ve refreshed the out-of-the-box dashboards, simplified the data streams, and improved the overall experience. So, let’s take a closer look at what’s changed.

Splunk on SGTech - Tech Transforms Life

With the explosion of data across endless environment, devices and applications, organisations and government agencies are faced with a pressing challenge of getting their data house in order to achieve efficiency, transparency, security and governance. Learn how Splunk helps businesses like Singapore Airlines, LG Electronics and DANA fintech group transform complex data into valuable business outcomes and strengthening digital resilience.

How to Improve Uptime and Achieve Root Cause Analysis (with Open Source!)

Observability doesn’t begin and end at telemetry or your ELK stack: most open source or vendor tools require configuration, dashboard customization, and may not actually pinpoint the data you need to mitigate system risks. Coroot was designed to solve the problem of time-consuming root cause analysis: it handles the full observability journey — from collecting telemetry to turning it into actionable insights. We also strongly believe that simple observability should be an innovation everyone can afford to benefit from: which is why our software is open source.

A Developer's Framework for Selecting the Right Tracing Vendor

Distributed tracing tracks requests as they flow through microservices, revealing bottlenecks, failures, and performance patterns. Without proper tracing, debugging production issues becomes guesswork—especially in complex architectures with dozens of services. Modern applications generate millions of traces daily. The right vendor helps you extract actionable insights without drowning in data or breaking your budget.

Why Datadog Falls Short for Log Management and What to Do Instead

Datadog may be the default choice for all-in-one observability, but its logging experience takes a back seat to the broader platform. Logs are primarily designed to feed into metrics and traces, which leads to tradeoffs such as slower search, complex workflows, and a UI that isn’t optimized for log investigations. As a result, Datadog doesn’t align with how developers actually troubleshoot.

How to Log Into a Docker Container

When your Docker container isn't behaving the way you expect, you need to get inside and see what's going on. Maybe your app is throwing errors, a service won't start, or you just need to check some configuration files. Getting into a running Docker container is simpler than you might think, but there are several ways to do it depending on your situation. This guide shows you exactly how to log into Docker containers, troubleshoot common issues, and debug your applications effectively.

Map, Transform, Filter: How Copilot Editor Helps Teams (and Their Pipelines) Have It All

Ever spent a week wrangling log pipelines just to get your SIEM to stop screaming about missing fields? Wasted way too much time stripping out noisy events and reformatting data for analytics? You’re not the only one. If you work in Security or ITOps, you know the pain: every new data source means another round of schema headaches, more manual mapping, endless field transformations, and a quick prayer that you didn’t break something critical (or let in a flood of junk events).

Elastic achieves AWS Education ISV Partner Competency, strengthening education solutions portfolio

Advancing digital transformation in education through Search AI and cloud innovation We’re thrilled to share that Elastic has achieved the AWS Education ISV Partner Competency. This prestigious designation recognizes Elastic as an Amazon Web Services (AWS) partner that has proven expertise in delivering high-quality solutions that help education institutions support successful student outcomes while protecting security and privacy.

Upgrade Readiness: Unlocking Success with the Splunk Health Assistant Add-On

Splunk recently announced exciting updates and significant modernizations for the upcoming releases of Splunk Enterprise and Splunk Cloud Platform. This blog is the first in a series to help prepare your organization for these changes by exploring upgrade readiness best practices. This first installment will highlight the Splunk Health Assistant Add-On, a vital tool that supplements the Splunk Enterprise Monitoring Console, designed to streamline your transition to the next version of Splunk Enterprise.

5 Tips for Managing Client Sites With Oh Dear

Managing dozens (or hundreds) of client sites can quickly become chaotic without the right tools. Whether you're running an agency, internal platform team or dev shop, visibility and control are everything. That's where Oh Dear comes in. Oh Dear is an all-in-one monitoring service that gives you a unified dashboard for uptime checks, performance monitoring, broken link detection, SSL and domain expiry alerts, scheduled task validation and more.

SentinelOne Outage: Why Early Detection and Independent Monitoring Matter

When SentinelOne, a leader in cybersecurity and endpoint protection, experienced a major outage last week, thousands of organizations were suddenly left in the dark. With SentinelOne down for hours, IT and security teams scrambled for information and updates. But there was a critical missing piece: SentinelOne has no public status page. This gap left customers frustrated, searching for answers on social media, Reddit, and unofficial channels.

Real-Time Observability with ClickHouse, Coroot, and GlassFlow

Coroot is excited to feature an editorial from GlassFlow for our first Open Source Spotlight. We hope to improve the workflow of our global community of SREs and DevOps professionals by sharing exciting projects like Glassflow, which make innovation accessible for everyone through the freedom of open source. If you have an open source or open core project you’d like to see on our blog next, send us a message!

Comparison of the Best and Most Popular NoSQL Databases

Traditional databases store data in structured tables, whereas NoSQL (non-SQL) databases use more flexible, non-tabular storage methods. NoSQL databases can store a wider range of data types, including document stores, wide columns, key-value stores, and graphs. These databases first emerged in the late 2000s to support massive horizontal scaling and high-throughput workloads for web applications.

Inside the Wins: Real Stories of Transforming Azure Observability into Business Value

Azure environments are growing fast, and so are the challenges of monitoring them at scale. In this blog, part of our Azure Monitoring series, we look at how real ITOps and CloudOps teams are moving beyond Azure Monitor to achieve hybrid visibility, faster troubleshooting, and better business outcomes. These real-life customer stories show what’s possible when observability becomes operational. Want the full picture? Explore the rest of the series.

Best practices for end-to-end custom metrics governance

Custom metrics enable you to track what matters to your distinct business and services and correlate it with the rest of your telemetry data. As your organization grows by adding more teams, services, and environments, your volume of custom metrics can grow with it. To ensure critical visibility while maintaining cost efficiency, organizations need an end-to-end approach to custom metrics governance.

Java License Monitoring - Why you need to monitor your Java licenses and how to do so

Java license monitoring has now become an essential requirement for many organizations as Oracle’s recent licensing changes have made compliance mandatory, with increased risks of audits and higher Java licensing compliance costs. Once a free programming platform, Java now requires navigating a complex licensing framework, including employee-based models that tie costs to the size of a workforce. These changes significantly increase the risk of unbudgeted expenses for licensing violations.

Monitor OpenTelemetry-native metrics with Datadog

OpenTelemetry (OTel) is emerging as the industry standard for collecting and transmitting observability data. Datadog supports several ways to send and accept OTel-native data, while also continuing to support its own native telemetry format. To provide a consistent monitoring experience, Datadog now supports using OTel-native metrics alongside Datadog-native metrics across dashboards, queries, and core visualizations in the Datadog platform.

Operational Resilience in 2025: Meeting New Standards, Mitigating New Risks

In a world of constant disruption, operational resilience is now mission critical. From cyberattacks and misconfigurations to vendor outages and natural disasters, today’s enterprises are navigating risks that move faster and hit harder than ever before. As we enter 2025, operational resilience has evolved from a best practice to a board-level imperative.

IETF Decreased Mean Response Time by 90% with Scout APM!

The Internet Engineering Task Force (IETF) is the premier Internet standards body, developing open standards through open processes. The IETF is a large open international community of network designers, operators, vendors, and researchers concerned with the evolution of Internet architecture and the smooth operation of the Internet. The IETF standards-setting process is open to any individual interested in providing technical contributions.

Unlocking Real-Time Collaboration: Why Your Network Is the Key to Vibe Working

Lately, there has been a growing buzz around the concept of “Vibe Working,” where teams are leveraging AI to dynamically share, develop, test, and transform “fuzzy” ideas into something useful in real-time. I view this approach as one of the next significant evolutions in our professional and technological landscape. Reflecting on my own journey in technology, I’ve observed how the pace of innovation and collaboration continually reshapes our daily workflows.

Introducing our improved uptime check

The past few months, we’ve working on improving our uptime check. We proud to announce that this improved check is now available for all users. You don’t have to do anything to get it (unless you are not subscribed to Oh Dear, in that case your should subscribe to Oh Dear ), all our users now have it enabled by default. In this blogpost, I’d like to give an overview of the changes and some background why we changed some things.

Optimizing the end-user experience: How to perform a browser check in Grafana Cloud Synthetic Monitoring

Synthetic monitoring is a vital practice to proactively track the health and performance of web applications. Instead of waiting for users to report problems, synthetic monitoring helps developers catch issues before they impact real users. One powerful type of synthetic monitoring is the browser check. These checks go beyond basic ping checks, simulating how a user would actually interact with your website’s interface.

How to send alerts from Grafana OSS to Grafana Cloud IRM

In March, we announced that Grafana OnCall (OSS) had entered maintenance mode. However, OnCall’s development continues in Grafana Cloud as Grafana Cloud IRM, combining on-call management and incident response into one integrated solution. Many users told us they still want to self-host Grafana and rely on Grafana Alerting to detect potential issues early—but they also need to escalate and manage incidents using an incident response management (IRM) solution.

How to send alerts from self-hosted Grafana to Grafana Cloud IRM

Learn how to send alerts from Grafana OSS or Grafana Enterprise to Grafana Cloud IRM. In this quick demo, we'll show you how to set up the integration between your self-hosted instance and our managed solution for consolidating, customizing, and automating incident response and management. Grafana Cloud is the easiest way to get started with Grafana dashboards, metrics, logs, and traces. Our forever-free tier includes access to 10k metrics, 50GB logs, 50GB traces and more.

Edge Data Replication: Contributions and Status Updates for InfluxDB 3

If you’ve ever stood up multiple edge InfluxDB instances in remote locations and wished you could consolidate their data into a centralized instance for analysis, you’re not alone. That’s exactly why we designed Edge Data Replication (EDR) in InfluxDB v2. Now, with InfluxDB 3 Core and 3 Enterprise, we’re seeing new ways to handle replication using the brand-new Python Processing Engine.

Your Collector, Your Rules: Introducing BYOC and the OpenTelemetry Distribution Builder

Join the live stream at 11 am ET, here. OpenTelemetry’s super-power has always been: Choice. Yet, most observability vendors still insist you run their collector. Today we’re removing that last point of friction. With Bring Your Own Collector (BYOC), Bindplane now accepts any upstream-compatible build, recognizes exactly which receivers, processors, and exporters it contains, and adapts the UI and configuration workflow on the fly.

Bindplane Launch Week 1 [June 2-6] - Day 2 - Custom OTel Collectors

The point of OpenTelemetry has been to give you a choice. Yet, most observability vendors still insist you run their collector. We’re removing that last point of friction. With Bring Your Own Collector (BYOC), Bindplane now accepts any upstream-compatible build, recognizes exactly which receivers, processors, and exporters it contains, and adapts the UI and configuration workflow on the fly. No forks, no vendor stamp—just the collector you already trust, fully managed by Bindplane.

Agentic AI: Powerful But Fragile-What You Need to Know

Just when you’d finally wrapped your head around AI, here comes its autonomous cousin, Agentic AI. Think of it as AI that doesn’t just assist, but acts. It makes decisions, handles tasks, and communicates with other systems on its own. While it’s revolutionizing supply chains and customer experiences, there’s a catch. These autonomous agents rely on a plethora of third-party services, and when one fails, everything stops.

Identifying Idle Paths in a Data Center Leaf-Spine Fabric

In leaf-spine data center networks, traffic often becomes imbalanced, leaving some uplinks idle and resulting in wasted bandwidth. Kentik helps engineers identify underutilized paths, diagnose the causes, and take corrective action using enriched telemetry, visual topology maps, and intelligent alerts, turning hidden inefficiencies into actionable insights.

Peacetime Observability: Spotting Risks Before They Become Incidents

Most of the time, nothing’s broken. Traffic’s flowing, alerts are quiet, and everything seems fine. That’s peacetime, when no one’s getting paged. Coroot helps in both peacetime and wartime. When things go wrong, it guides you to the root cause fast. But during peacetime, it helps you spot risks early, clean up inefficiencies, and prevent those incidents from happening in the first place.

Graylog vs ELK: Which Log Management Solution Fits Your Stack?

Your app logs start simple—maybe a few print() or logging.info() calls. But in production, things get noisy. Thousands of log lines per minute, scattered across services, and it’s hard to know what matters. This is when tools like Graylog and the ELK stack help. They let you collect, search, and make sense of logs, but they do it in different ways. This guide breaks down how each one handles setup, scale, and day-to-day use.

How to Fix Latency Spikes in WAN and LAN Networks

Even a few seconds of delay in your network can be the difference between closing a deal on a video call, or watching it buffer into oblivion. These delays, known as latency spikes, are unpredictable surges in the time it takes for data to travel across your network. Whether you're running a cloud-based CRM, managing VoIP calls across offices, or supporting remote teams on Microsoft Teams or Zoom, latency spikes can disrupt productivity, hinder performance, and lead to a flood of support tickets.

How to Monitor and Manage Grafana Memory

It’s late, you get an alert, and Grafana is down. The reason? It ran out of memory. If you’ve ever watched Grafana slowly eat up RAM until it just stops responding, you know how frustrating that can be. Memory can spike quickly, especially with complex dashboards and multiple data sources. This guide will help you understand what’s going on and how to keep Grafana running without surprises.

How to Set Up Tracing for Elixir Apps Using AppSignal

Over time, web applications have evolved from simple request/response-based systems into complex, distributed ones with lots of moving parts. If something goes wrong (and you can be sure it will), finding the cause can be nearly impossible. But this need not be the case: enter tracing. Tracing refers to the process of collecting detailed information about the execution of requests within an application, including function calls, execution time, and other relevant data.

Top five metrics to monitor in IIS Logs

When managing and troubleshooting IIS (Internet Information Services) web server performance, logs are a critical resource. They capture detailed information about every request and response so your team can detect issues quickly. Let’s walk through the main IIS log formats, explore a sample log file, and break down five key types of IIS metrics you should monitor.

NiCE DB2 Management Pack 5.40

NiCE is proud to announce the availability of the NiCE DB2 Management Pack 5.40, a new milestone in advanced monitoring and management for IBM DB2 environments. Version 5.40 introduces powerful enhancements that improve efficiency, compatibility, and ease of use: Cluster Synchronization Improvements Ensures more accurate and efficient configuration sync across clustered deployments.

Service Level Objectives -- Customer Brown Bag -- May 29th, 2025

This technical session on Service Level Objectives (SLOs) will cover the fundamentals of SLOs, SLIs, and SLAs, along with how to define, monitor, and optimize them for system reliability. Through hands-on demonstrations, you'll learn to set up SLOs in Sumo Logic, track performance using logs, metrics, and tracing, and configure proactive alerts for incident response. By the end, you’ll have the skills to implement and manage SLOs effectively, ensuring your services meet reliability goals while balancing performance and cost.

Introducing RUM without Limits: Capture everything, keep what matters

Real User Monitoring (RUM) helps teams understand exactly how their users experience their web and mobile applications—from load times to crashes and frustration signals. But traditional RUM models come with tough trade-offs: capture all sessions and overspend, or sample data and miss what matters. Fixed sampling rates may help manage volume, but they leave dangerous blind spots.

Unify telemetry, own your pipeline: New integrations for Windows, Network Telemetry, and Cloud Storage

Today, we're expanding on the integrations front, and launching new integrations for Windows events, network telemetry, and cloud storage. Here's a quick tour of what's new and why it matters.

What Are The Top Website Monitoring Services in 2025?

Every business owner understands the importance of website monitoring. It is essential to avoid website performance and availability issues. A great start would be to examine every aspect of your web infrastructure. That's where website monitoring tools come into the picture. With website monitoring services, you can continuously observe your website's performance and uptime. These tools make you aware of any server downtime or connection issues.

Monitoring Backstage with OpenTelemetry:Closing the observability blind spot

‘One small step for a man, but a huge leap for developers’ — me, when I realised how to observe my Backstage with OpenTelemetry. Backstage is often the “portal” through which we manage all our other systems, but who watches the watcher? Recently, we gave a KubeCon Talk, highlighting that monitoring Backstage itself is critical. When Backstage isn’t observable, it becomes a blind spot in your infrastructure.

OnlineOrNot updates from May 2025

As OnlineOrNot has grown, I've been building features quickly to get them into your hands as fast as possible. However, this meant I ended up with multiple versions of similar pages that looked and worked differently from each other. This month, I focused on putting systems in place to create a consistent experience across all parts of the dashboard, making everything look and feel unified.

Hybrid IT Infrastructure Management

Today’s IT environments are rarely confined to a single data center or a single cloud provider. Enterprises are embracing a mix of cloud platforms, virtual machines, and on-premises hardware to stay agile and competitive. This blended environment is known as hybrid IT infrastructure, and managing it effectively is key to keeping systems healthy, secure, and performing at their best.

Simple cloud cost management: Grafana Labs integrates open standard FOCUS specification for cloud billing data

At Grafana Labs, we’ve always believed that observability should be open and accessible — that belief extends beyond metrics, logs, and traces to the costs associated with managing observability at scale. That’s why we’re excited to share that we’ve adopted the FinOps Open Cost and Usage Specification ( FOCUS), a community-driven, open standard for cloud billing data.

Sigma Specification 2.0: What You Need to Know

Sigma rules have become the security team equivalent of LEGO bricks and systems. With LEGO, people can build whatever they can imagine by connecting different types of bricks. With Sigma Specification 2.0 rules, security teams can create vendor-agnostic detections without being limited by proprietary log formats. In response to the Sigma rules’ popularity, the team that built them updated them in August 2024, giving security teams new capabilities.

Jaeger vs Zipkin: Which is Right for Your Distributed Tracing

When requests slow down across your microservices, tracing helps you understand where time is spent. Jaeger and Zipkin are two popular tools for distributed tracing, built to answer a simple question: where did the request go? If you're choosing between them or just exploring options, this guide breaks down the differences and when each one might be a better fit.

Prometheus Alerting Examples for Developers

Everything looks fine—dashboards are green, logs are quiet. But users start reporting slow response times. No errors, no traffic spikes. Just a general slowdown. It’s a common situation. Not all problems show up as crashes or clear failures. Sometimes, performance degrades quietly, and standard metrics don’t catch it early. But that's where Prometheus alerting can help, if you're monitoring the right signals.

Why Resilience, Not Just Visibility, Is the New Mandate

We’ve been in the war rooms. We’ve watched revenue, reputation, and trust erode in real time—not because we lacked telemetry, but because we lacked architecture. Modern enterprise systems fail because their data doesn’t think. Their tooling doesn’t remember. And their automation doesn’t know when to act—or when to stop. The answer is not more monitoring. It’s not dashboards with AI labels.