Operations | Monitoring | ITSM | DevOps | Cloud

Monitoring OAuth 2.0 Client Credentials Flows in Web APIs

OAuth 2.0 client credentials flows are a core mechanism for machine-to-machine API authentication. They enable background jobs, microservices, and system integrations to securely access APIs without user interaction. However, while most teams spend time configuring these flows, far fewer ensure they are continuously monitored in production. This creates a critical blind spot: OAuth failures often surface only after dependent services begin failing.

Why High-Cardinality Metrics Break Everything

High-cardinality metrics are one of those ideas that sound obviously right - until you try to use them in production. In theory, they promise precision. Instead of averages and rollups, you get specificity: per-request, per-userid, per-container, per-feature insights. The kind of detail we all immediately want when something is on fire. And then things start breaking. Not immediately. Not loudly.But quietly.

The Python Backend Framework Decision Guide for 2026

Three frameworks dominate Python backend development in 2026: Django, FastAPI, and Flask. This guide helps you choose between them (plus specialized alternatives like Falcon, Tornado, and Litestar) using a simple decision tree. Answer three questions about your project, understand each framework's strengths, and pick the right tool for your needs.

Beep boop: How to visualize Grafana Cloud IRM alerts in the real world

You know the situation: You're in a meeting and your alerts start to go off, but no one on the other side of the camera knows why you have to abruptly drop from the call. What if, instead, you had a robot in the background of your Zoom meeting that started to blink when those same alerts went off? You could just point to it, type in the chat "I have to drop," and off you'd go.

AI-generated media: What's the point?

If you have even a minor social media presence, you've probably been unfortunate enough to come upon the wonderfully disturbing world of AI slop content. We're talking wrestling matches featuring controversial mustached historical figures and Formula One-style races featuring Stephen Hawking in his wheelchair (if you have no idea what I'm talking about, I genuinely envy you).

Zero code tracing: Kubernetes observability with Logz.io and eBPF

Distributed tracing is a core tool for operating modern microservices platforms. For SREs and DevOps teams, it is often the fastest way to understand latency issues, service dependencies, and unexpected failure modes. But achieving comprehensive tracing coverage is resource-intensive and time-consuming. It usually requires application changes, language-specific instrumentation, agent lifecycle management, and ongoing coordination with development teams.

Normalize any logs for Cloud SIEM with Datadog's OCSF processor

Security teams need visibility across every system they defend, including cloud platforms, SaaS applications, security controls, identity providers, and custom services. But those systems all produce logs in different formats, with inconsistent field names and structures. That lack of standardization makes it harder to correlate events, write reusable detections, and investigate incidents quickly.

Shorten your 'inner loop' as a new hire and get past imposter syndrome with Grafana Assistant

Let's talk about being new. Four months ago, I joined Grafana Labs as a senior solutions engineer. It wasn’t just a new company, it was a new industry. I came from the visual workspace provider Miro, where I was comfortable doing discovery and talking about visual collaboration and innovation. But stepping into observability? I was in the deep end. And let me tell you, the imposter syndrome was real. Everyone around me was fluent in this language of metrics, logs, and traces.

Episode 4 - 2025 AI Retrospective and What's Next for 2026

In this special holiday episode of The Intelligent Enterprise, host Tom Stoneman takes a step back from the day-to-day pace of enterprise life to look at where AI has been in 2025 and where it might be heading next. To do it, he sits down with his colleague VS Joshi, Global Head of Product Marketing at Digitate, for a year-end retrospective and a 2026 outlook.

5 Reasons Why Website Design is Now an Operational Concern

There was a time when website design lived entirely in the marketing department-all about how your brand looked, how long visitors stayed, and how credible you seemed. A beautiful site meant trust, and a bad one meant lost sales. Simple as that. But that version of "web design" doesn't exist anymore. With the rise of JavaScript-heavy frameworks, cloud infrastructure, and performance-driven SEO, design has become an operational concern.

GitHub Outage Tracker: 5 Real-Time Monitoring Methods

When GitHub goes down, everything stops. Your developers can't push code. CI/CD pipelines hang indefinitely. Pull requests pile up. Deployments freeze. And if you're like most engineering teams, you find out about it when your Slack channel explodes with "Is GitHub down for everyone?" The average GitHub outage could cost teams 2-4 hours of developer productivity. For a 50-person engineering org, that's 100-200 hours of lost work — assuming you catch the outage immediately. Most teams don't.

Top server monitoring tools for 2026: A comprehensive comparison guide

IT infrastructure is now hyper-distributed. We are in a scale-in-seconds era and that means, a typical IT landscape is spread across on-premises data centers, public clouds (AWS, Azure, GCP), containerized environments, and edge locations. With many components comes more points of failure. A single server outage can cascade into customer-facing incidents, SLA violations, and revenue loss measured in thousands per minute.

Unified Observability: What It Is and Why It Matters for Large Enterprises

Modern enterprises operate within a digital ecosystem of staggering complexity - spanning on-premises systems, private and public clouds, APIs, containers and SaaS platforms. Business-critical services often rely on a mix of legacy infrastructure and modern applications, each producing huge volumes of metrics, log messages, traces and events.

High Bandwidth Usage Detected - Causes, Impact, and Response

You log into your network monitoring dashboard and see the alert: “High bandwidth usage detected.” This is not just a routine message; it’s a sign that something is putting pressure on your network. Bandwidth is the backbone of modern connectivity, and when usage spikes unexpectedly, the consequences can be severe. Applications slow down, cloud costs rise, and in some cases, spikes may point to a security threat.

From Firefighting to Foresight: Bright Beginnings for a New Year of IT Confidence

When I was invited to join one of our customer’s end-of-year team wrap-up sessions, it came as no surprise when the meeting opened with a familiar refrain: “Next year will be different. Next year, we’ll get ahead of the noise. Next year, tickets won’t pile up while we’re still triaging yesterday’s issues.

Smarter Slack Alerts with Rollbar + Zapier AI

For many engineering teams, Slack is the nerve center of daily work. It’s where incidents are discussed, decisions are made, and collaboration happens in real time. But when it comes to error alerts, Slack can quickly turn from helpful to overwhelming with noisy, context-poor notifications that developers learn to ignore. By integrating Rollbar with Zapier AI, teams can transform raw error data into clear, actionable, and meaningful Slack messages, resulting in faster triage, less alert fatigue, and smoother developer workflows.

Authorization Code Flow & redirect_uri_mismatch Errors: Monitoring & Fixing

If you’ve implemented OAuth 2.0 using the Authorization Code Flow, chances are you’ve encountered the redirect_uri_mismatch error at least once. It’s one of the most common (and most misunderstood) OAuth failures teams face when integrating authentication into web applications. On paper, the error is simple. The authorization server compares the redirect URI sent in the request with the redirect URIs registered for the application.

Smarter Slack Alerts with Rollbar + Zapier AI

For many engineering teams, Slack is the nerve center of daily work. It’s where incidents are discussed, decisions are made, and collaboration happens in real time. But when it comes to error alerts, Slack can quickly turn from helpful to overwhelming with noisy, context-poor notifications that developers learn to ignore.

JSONPath & JSON Validation for Web API Monitoring Assertions

Most API monitoring setups still rely on a narrow definition of success: Did the endpoint respond, and did it return a 200 status code? While availability is essential, it’s no longer enough for modern, API-driven systems. In real production environments, APIs frequently return successful HTTP responses with incorrect or incomplete payloads. Authentication endpoints may issue tokens missing required fields. Business-critical APIs may return empty objects instead of valid data.

Application Monitoring 101: How to Correlate Average Response Time With Other Metrics

Average response time has become the default metric on many dashboards. It's easy to compute, easy to explain, and provides a single number to track over time. Of all the metrics available in application monitoring, this one feels closest to the actual user experience. But this simplicity can create a trap if you treat the average as a complete picture of system health. In fact, it’s really the starting point for a deeper investigation.

Observability for Feature Flags

Some of your users are having a party; dancing away, having a great time. But a couple of users are stuck outside in the rain, knocking on the door, trying to get in. Unfortunately, you can’t hear them because of all the noise happening inside. That’s what it feels like when you gradually roll out new features across your user base without the right monitoring.

The Year in Making - Fabrix.ai 2025: From CloudFabrix to Agentic AI Leadership

Just as NASA’s Artemis II mission represents humanity’s return to the Moon after more than 50 years, marking a pivotal moment in space exploration, Fabrix.ai has embarked on its own transformative journey in 2025. Artemis II—targeted for launch in February 2026 completed its crucial countdown demonstration test in December 2025, symbolizing humanity’s readiness to venture beyond Earth for deep space exploration and eventually return to the lunar surface.
Sponsored Post

2026: The Year Agentic AI Disrupts Observability, Security, and Enterprise SaaS

The enterprise technology market is at an inflection point. 2026 will be the year agentic AI fundamentally disrupts how organizations approach observability, security, and IT automation. The traditional SaaS model—with its sprawling ecosystem of disconnected point solutions—is collapsing under complexity. What’s replacing it is a consolidated platform layer powered by autonomous agents that operate across systems, consolidate data, and execute workflows autonomously.

Online HTTP Clients vs Web API Monitoring: When Each Makes Sense

When teams talk about online HTTP clients, they’re usually referring to quick, browser-based ways to send requests, especially HTTP POST requests, without standing up local tooling or infrastructure. These tools are popular for good reason. They make it easy to submit payloads, test headers, and inspect responses in real time. For developers, QA engineers, and DevOps teams, they’re often the fastest way to answer a simple question: Does this request work?

Monitoring JWT Tokens & OAuth Token Endpoints: How to Catch Authentication Failures Before APIs Break

Modern APIs rarely fail because the application logic is down. More often, they fail because authentication breaks upstream, silently. OAuth token endpoints and JWT-based authentication sit at the front of nearly every protected API. When they degrade, misconfigure, or stop issuing valid tokens, every dependent API call fails, even if the API itself is healthy.

Monitoring OAuth 2.0 & Secure Web API Authentication Flows

OAuth 2.0 is often treated as a solved security problem; configured once, then forgotten. In reality, OAuth-based authentication is one of the most fragile dependencies in modern API ecosystems. When OAuth breaks, APIs don’t just degrade gracefully; they often fail completely. For DevOps and engineering teams, OAuth 2.0 authentication sits before application logic, before business rules, and before observability inside the service itself.

Introducing Dossinth AI (formerly Silicon Sage)

The AI saga continues. I feel like every business felt like they had to add some AI into their product to stay relevant, but reality shows us that's not the case. Same thing at Monitive. Not looking to add AI to our service until it proves its usefulness. However, using it as an internal tool is all doable. I am still not 100% convinced of its usefulness, but I am hoping to radically improve my relationship with our customers, and - thus - the service itself.

API Testing vs Web API Monitoring: Postman, Online Tools, and WebView

APIs sit at the core of modern applications. They power mobile apps, connect microservices, and enable third-party integrations, making them critical to performance, reliability, and revenue. That’s why most teams invest heavily in API testing tools like Postman, automated test suites, and online API testers. And yet, production outages still happen. This disconnect (“our APIs were tested, so why did they fail?”) is where confusion between API testing and Web API monitoring begins.

VictoriaMetrics Anomaly Detection: 2025 Roadmap & Features (vmanomaly)

Discover the latest advancements in AI-driven monitoring with VictoriaMetrics. Fred Navruzov, Lead of the Anomaly Detection team, presents a comprehensive year-in-review for vmanomaly (part of the VictoriaMetrics Enterprise suite). This session dives into how we are making machine learning more accessible for SREs through new interactive tools and protocol integrations. Key Highlights: 2025 Recap: A look back at the major releases and improvements in vmanomaly. Interactive Playgrounds: A demo of our new environment for testing anomaly detection models before deployment. MCP Server Integration.

From Postman Collections to 24/7 Web API Monitoring (Step-by-Step)

Postman API test automation is a critical part of modern API development. Teams rely on Postman collections, scripts, and automated tests to validate endpoints, catch functional issues early, and ensure APIs behave correctly during development and CI/CD pipelines. But as APIs move into production, test automation alone leaves important gaps.

A 2026 essential features checklist for choosing the best synthetic monitoring tools

Advanced synthetic monitoring has become an essential for any strong digital plan, not simply a nice-to-have feature. Our reasoning is because users expect websites and apps to work well all the time. As we approach 2026, the top synthetic monitoring solutions have greatly improved, going from just checking if a service is online to ensuring a complete and positive digital experience.

Engineering robust monitoring scripts using advanced synthetic monitoring software

Synthetic monitoring evolved from simple uptime checks to a complex technical field in modern digital operations. The real challenge for organizations that use synthetic monitoring software isn’t implementing the monitoring; it’s writing scripts that stay accurate, simple to maintain, and resistant to changes in the application.

What is high cardinality, and is it as scary as people make it out to be?

Dawid Dębowski is a software engineer at G2A.COM and a Grafana Champion. Holding an MS of Computer Science, Dawid’s main fields of interest related to observability are PromQL and data visualizations using Grafana. If you’ve ever worked with custom metrics in a Prometheus environment, you've probably heard about something called "high cardinality"—or at least I hope you have.

How Browser Hijackers Impact Enterprise Observability and Monitoring Tools

The browser is an essential component for enterprise execution. Given the browser's importance, observability relies on accurate, trustworthy telemetry. Browser hijackers are a dangerous threat because they operate below the radar and introduce operational risks that undermine monitoring reliability, degrade signal quality, and affect decision-making and telemetry across an enterprise's ecosystem.

Top Tips for Building a Knowledge-Sharing Culture

Top Tips is a weekly column where we highlight what’s trending in the tech world and list ways to explore these trends. This week, we’re tackling a critical challenge for modern organizations: creating a culture where knowledge flows freely. Let’s be honest—a team that doesn’t share knowledge eventually hits a wall. In today’s fast-paced environment, keeping information siloed only leads to slow decisions, repeated errors, and missed chances to improve.

How Advanced Synthetic Monitoring Ensures Compliance and 24/7 Uptime in Financial Services

Advanced synthetic monitoring has gone from being a technical convenience to a regulatory and operational necessity in today’s dynamic financial services ecosystem, where there is no scope for error. Traditional uptime testing just checks to determine if systems are accessible.

Site24x7 wrapped 2025: A year of growing together

The business world doesn’t expect its best gifts to be wrapped in ribbons. Gifts come in the form of quiet moments when everything you’ve invested in and count on comes through. For example, when you leave the office knowing your systems are in good hands. That's what 2025 was about for us at Site24x7. We evolved towards building the kind of observability that doesn’t just feel like surveillance, but more like consistent support by your side.

Simplify hybrid network monitoring with OpManager Plus

Enterprise networks have evolved from simple on-premises network into a sprawling ecosystem of on-premises infrastructure, cloud platforms, virtual machines, and containerized applications. Organizations are embracing hybrid and multi-cloud strategies to gain agility, scalability, and resilience. But with that evolution comes greater complexity and a new set of challenges for IT teams in maintaining the performance of the hybrid networks.

Top Synthetic Monitoring Solutions for Enterprise DevOps Teams

Legacy monitoring creates dangerous visibility gaps in the accelerated enterprise DevOps landscape, where release cycles count in hours, not weeks. For teams managing hundreds of microservices, complex cloud-native architectures, and global user bases, basic synthetic monitoring tools simply cannot scale. The top synthetic monitoring solutions for enterprise DevOps must function not as mere observability tools, but as proactive, integrated safety nets engineered for scale, security, and precision.

How to perform HTTP checks in Grafana Cloud Synthetic Monitoring

Your users should not be the first to know when your application goes down. When HTTP endpoints fail or respond sluggishly, users experience timeouts, connection errors, and degraded performance — often without clear indication of the root cause. This is where HTTP checks in Grafana Cloud Synthetic Monitoring come in, allowing you to proactively monitor your endpoints, verify they're online, measure response times, and ensure they're returning the correct status codes.

Heartbeat behind the metrics | The people behind Site24x7

The best products aren't built in isolation—they're built in conversation. This video brings together voices from across the Site24x7 leadership team for an honest conversation about what it takes to build an observability platform that teams rely on every single day. You'll hear from our product managers and support team heads who've spent years thinking about one persistent question: how do you turn data into clarity when systems continue to become more complex?

Why Use a Purpose-Built Time Series Database

A time series database has a straightforward definition: it’s a database purpose-built for efficiently ingesting, storing, and querying time series data. Time series data is any data with a timestamp, collected regularly or periodically, that you’ll often visualize on graphs where the X-axis is time. This definition doesn’t quite tell you what sets it apart from other types of databases, though.

2025: The year of the global cloud outage

StatusGator has been monitoring the world’s cloud services for more than 10 years now. We’ve seen outages, big and small, affect companies of all sizes for more than a decade. Yet as we close out 2025, it feels like the last 12 months brought us some of the biggest outages in the history of the internet. In fact, by our data, this is true! Never before in history have so many huge outages taken down so much of the internet, in such a short time.

Blameless Postmortem: Foundation of Site Reliability

When systems fail, the instinct to find someone to blame runs deep. But what if assigning fault actually makes your systems less reliable? A blameless postmortem culture transforms how teams learn from incidents, creating stronger systems and more effective incident response processes.
Sponsored Post

Avantra 25.2: Enhancing Security and Reducing Complexity in Hybrid SAP Landscapes

I am pleased to announce the release of Avantra 25.2! While 25.2 is a service release focused on software stability, it introduces several powerful new features designed to streamline SAP automation and improve operational resilience. Let's break down the key deliverables and benefits for Avantra users in this release.

Is Northern Virginia Still the Least Reliable AWS Region in 2025? We Analyzed the Data

This updated analysis is based on StatusGator outage data collected from January 1 to December 9, 2025. We decided to review our AWS analysis of outages in 2022 due to several new AWS incidents, especially another widely discussed AWS outage in us-east-1 (N. Virginia) that occurred on October 20, 2025. We’ve expanded the report with fresh 2025 regional data as well as a new breakdown of affected AWS services.

Why DPDP compliance must include network configuration governance

India’s Digital Personal Data Protection (DPDP) Act places accountability on how organizations collect, process, and store personal data to help organizations stay steps ahead of threat actors. Forrester’s CIO roadmap highlights a clear shift: compliance is no longer limited to policies and consent workflows. CIOs must extend governance deeper into the technology stack, including infrastructure that directly impacts data security.

Empowering IT teams: Site24x7's mobile app updates in 2025

Present IT structure requires flexibility, speed, and accessibility. This year marked a significant milestone for Site24x7 as we launched our enhanced mobile application, transforming how IT teams manage their infrastructure while on the go. Whether you're responding to critical alerts during your commute or reviewing performance metrics between meetings, Site24x7's mobile app puts its entire suite of monitoring capabilities directly in your hands.

Powering modern IT with a smarter observability platform

Since its inception, the Site24x7 platform has been the central pillar of monitoring. In 2025, it evolved beyond monitoring to become a comprehensive decision-making layer for modern IT operations. With a strong focus on usability, intelligence, governance, and scalability, this year’s enhancements were designed to help teams see clearly, act decisively, and plan confidently for the future.

Component statuses: Now in the API

The StatusGator API continues to expand with new end points to help support the wide variety of use cases our customers have. We just released two new APIs: In case you missed, it component filtering is one of StatusGator’s most important features, allowing you to filter your service monitor to just the specific products, regions, or features you use. It’s an essential setup step that helps minimize noise.

Driving AI ROI: How Datadog connects cost, performance, and infrastructure so you can scale responsibly

AI innovation has accelerated faster than most organizations’ ability to monitor and manage it. The shift from experimentation to production-scale workloads has driven a new class of operational challenges: rising GPU costs, opaque model performance, and the difficulty of linking spend to business value. As AI investments grow, executives need a unified way to measure efficiency and return without slowing down innovation.

How to Integrate App Synthetic Monitoring into Your CI/CD Pipeline for Flawless Deployments Meta Description:

In today’s age of continuous delivery, a failed deployment or a drop in performance can affect thousands of users in just a few minutes. Traditional testing happens before deployment, but what about after the code is live? This is where app synthetic monitoring becomes a critical part of your CI/CD pipeline. Integrating synthetic monitoring into CI/CD transforms your pipeline from a simple delivery mechanism into a proactive quality and performance gatekeeper.

2026 Observability Predictions: What Lies Ahead?

What remains of the 2025 AI hype? After a year of “AI will fix everything” promises, engineering teams in 2025 hit a wall of reality: AI is a tool, not a magic bullet. We’re now seeing a more practical approach: identifying broken workflows and tasks where AI can help and leveraging AI strengths like data analysis at speed and scale to derive meaningful, valuable insights. Looking ahead, 2026 will reward organizations that combine AI innovation with a practical approach.

Part 3: What If IT Stopped Reacting to Incidents and Started Predicting Them?

Enterprises are experiencing a turning point. Systems scale faster than teams can, AI is rewriting the rhythms of operations, and the cost of downtime grows heavier every quarter. In this new landscape, reacting is no longer enough. Teams need foresight. They need to get ahead of the issue. They need a different model entirely. This third installment centers on a simple but transformative idea. What if IT operations could finally step out of reaction mode and move into anticipation?

Detect, diagnose, and resolve network issues easily with CNM Network Health

In many organizations, developers, SREs, network engineers, and security teams work in specialized domains, which can make it hard to establish a shared view of network health. As a result, engineers often struggle to determine when a network problem that originates outside of their domain of expertise is the root cause of an incident. This lack of visibility slows investigations and delays remediation.

Sampled analysis of 10 billion spans with Coralogix highlight comparison

The CNCF reported that between 39% and 56% of organizations surveyed are now ingesting traces as part of their observability strategy. Tracing has become a cornerstone of any modern observability operation. Customers are regularly handling 10s of billions of spans every day, but with billions of spans, how can teams quickly figure out what is changing, what’s breaking, or what’s slowing down?

Introducing Real-Time Conversations with Netdata AI

Over the past few months, we’ve seen incredible adoption of our AI Investigations and Insights reports. Teams are using them to automate the deep, thoughtful analysis required for complex post-mortems, capacity planning, and performance optimization. These comprehensive reports are fantastic when you need a well-researched, shareable document. But what about the moments during an investigation?

Grafana community dashboards: Memorable use cases of 2025

Every year, Grafana dashboards surface in new corners of the world. And this year, they even reached beyond this world—helping one team land on the moon and another monitor the planet’s health with orbiting satellites. Meanwhile, back here on Earth, the community used Grafana to track everything from wind turbines and wastewater to March Madness and Taylor Swift’s worldwide tour. Here’s a look back at some of the most memorable Grafana community dashboards of 2025.

Drive business outcomes with Unit Economics in Datadog Cloud Cost Management

See how Datadog turns cloud usage and performance data into actionable business insights by helping teams calculate unit economics to measure and optimize the efficiency of every service. You’ll discover how to: Datadog bridges the gap between cloud costs and business value—helping organizations get the most value out of their cloud investment.

From Waste to Asset: Transforming Inefficient Systems into Strategic Business Power

Is your technology working for you or against you? For many business leaders, the answer feels obvious. You see the symptoms every day: frequent downtime, slow performance that grinds productivity to a halt, and a constant stream of frustrating disruptions that pull your team away from their real work. These aren't just minor annoyances; they are significant financial liabilities.

CloudSpend in 2025: Making cloud cost management easier at scale

In 2025, cloud environments became more distributed, and cloud costs followed suit. Managing spend across multiple providers, teams, and business units required a more deliberate, governed approach, when visibility alone was no longer enough. Organizations needed clearer ownership, better structure, and tools that could scale alongside their cloud usage.
Sponsored Post

Cloud Outages Are Rising: How Early Signals Help IT Teams Respond Faster in 2026

Cloud outages used to be rare, headline-making events. Today, they're part of the daily reality of running digital operations. Whether triggered by a configuration error, network routing issue, API failure, or global infrastructure disruption, cloud incidents now occur frequently, propagate quickly, and affect more services than ever before. In 2025, one trend has become undeniable: Teams that detect cloud outages early experience less downtime, respond faster to incidents, and avoid unnecessary internal chaos.

Digital Risk Analyzer 2025: Digital security fortified

As digital adoption accelerates, the attack surface expands just as rapidly. Digital Risk Analyzer consistently evolves to secure your digital frontiers, and 2025 is no exception. This year-end recap highlights the key enhancements introduced in past months and how we continue to deliver tangible value to end users.

Why OTT platforms crash and what it teaches us about traffic surges

Minutes after the newest episodes of a beloved series dropped, a well-known streaming OTT (over-the-top) platform crashed. The impact was instant: streams wouldn’t load, logins failed, and users across regions started refreshing their screens, wondering if the issue was on their end. Outages like this don’t often happen, especially for an engineered and distributed platform—which is precisely why this incident caught attention.

Faster resolution, better outcomes: Site24x7's digital experience monitoring innovation recap

Another year, another leap forward in digital experience monitoring. As we wrap up 2025, we're thrilled to reflect on the transformative capabilities we've brought to Site24x7—innovations designed around one core mission: helping you deliver a flawless user journey worldwide. This year, we focused on closing visibility gaps, eliminating blind spots, and putting effective insights directly into your hands.

New Option: Preserve URL Casing

Most web servers treat URLs as case-insensitive. A request to /About-Us lands on the same page as /about-us or /ABOUT-US. So when Request Metrics captures your traffic, we normalize all URLs to lowercase to prevent these duplicates from cluttering your reports. But not every system works that way. Some web frameworks (looking at you, Node and Python) treat URL casing as meaningful. /User/Profile and /user/profile might be completely different routes.

Get started with Grafana Alerting: Link alerts to visualizations

In this tutorial you will learn how to link alert rules to time series panels for better visualization. Don't miss the rest of the "Get started with Grafana Alerting" series! Each part dives into a different feature to help you get the most out of alerting in Grafana.

How microservice architectures have shaped the usage of database technologies

In the late 2000s, the big question in database design was SQL or NoSQL. While relational databases had long held their ground, document and key-value stores were emerging as serious alternatives. Many predicted a zero-sum, winner-take-all outcome. But when we look at how organizations are using database technologies today, no single tool or category has dominated the landscape.

Securing customer logins with breach intelligence

Account takeovers (ATOs) are one of the most common threats facing online platforms. Attackers buy leaked usernames and passwords on underground markets then test them at scale across websites, hoping that password reuse will give them easy access. Today, ATOs have grown so sophisticated and fast-moving that manual incident response often can’t keep pace, requiring intelligent defense systems for detecting compromised credentials and preventing misuse at scale.

The Ultimate Blueprint for Successful Synthetic Monitoring Implementation

In today’s digital world, the performance of websites and apps has a direct effect on sales, customer satisfaction, and brand reputation. Synthetic performance monitoring provides the proactive intelligence needed to ensure your application is always performing optimally. By simulating real user interactions from global locations before issues affect actual users, you transform from reactive problem-solving to proactive performance excellence.

Top 3 Trends Defining Network Observability in 2026

As we enter 2026, the dust has settled on the initial explosion of hybrid work and cloud adoption. The "new normal" is no longer new; it is simply operations as usual. However, the tools we use to manage this ecosystem are undergoing a massive correction. The fragmented, tool-sprawl approach of the early 2020s is proving unsustainable in the face of growing network complexity. Network operations teams are no longer looking for more data; they are looking for better answers.

Reducing OpenTelemetry Bundle Size in Browser Frontend

When I was building applications, I used to always rely on the DevTools console of my web browser to examine logs in the frontend. But, with UI log messages only being accessible within your browser rather than forwarded to a file somewhere, which is the common pattern with backend services, losing visibility of this resource when triaging user issues was a real dilemma.

Simplifying Microsoft Sentinel Integration: VirtualMetric DataStream Connectors in Content Hub

Microsoft Sentinel adoption often introduces unexpected complexity. While the platform delivers powerful SIEM and XDR capabilities, organizations frequently struggle with manual DCR configuration, inconsistent data quality, rising ingestion costs, and security risks associated with credential-based integrations. VirtualMetric DataStream is now available in the Microsoft Sentinel Content Hub, reducing the effort required to deploy normalized and cost-optimized data ingestion.

IoT Sensor Data into Graylog: A Lab Guide

Graylog has always been associated with log management, metrics, SIEM and security monitoring—but it’s also a great tool for creative, low-cost experiments in a home lab. I wanted to use it for real-world sensor data, so I built a DIY temperature and humidity monitor using an ESP-WROOM-32 development board and a DHT22 sensor.

How to Audit AI-Written Pull Requests Without Burning Out

If it feels like your GitHub notifications are a targeted DDoS attack on your brain, you aren't imagining it. Data from GitHub's Octoverse 2025 report shows an average of 43.2 million pull requests merged every month, a 23% jump from just a year ago. This surge in activity coincides with the widespread adoption of AI tools to write code. The temptation to just click "Approve" on a well-formatted AI-written pull request is higher than ever.

Calm Under Pressure: Ending the Year Without the Fire Drills

From the outside looking in, I have seen that year end in financial services is not for the faint-hearted. Markets tighten, trading volumes swell, payment systems hit their annual peak, and regulatory reporting deadlines stack up like dominoes. In this environment, even a few seconds of lag can mean missed trades, delayed transactions, frustrated clients, or worse, financial loss and reputational damage. This is precisely when IT needs to be at its calmest.

Configuration as Intelligence: The New Operating System of Resilience

Modern IT operations live in constant flux. New tools appear, workloads shift to the cloud, architectures fragment, and every device, application, and user brings its own update rhythm. In this state of constant motion, reliability isn’t a static condition; it’s a dynamic discipline. For years, organizations have relied on observability and monitoring to keep systems running. But those tools only tell half the story.

5 Best Proxies for Geo-Targeted Campaign Monitoring (2026)

Geo-targeted campaign monitoring is only as accurate as the vantage point you test from. If your checks always come from the same IP range, country, or data center footprint, you'll miss the exact issues you're trying to catch: region-specific ad delivery problems, localized SERP differences, mismatched landing pages, price inconsistencies, affiliate compliance issues, and unnecessary blocks triggered by anti-bot systems.

A Deep Dive into Synthetic API Monitoring

Consider this scenario: Your mobile app shows “Network Error” to 30% of users. Your dashboard shows that all of your servers are green. Your support team is quite busy. After four hours of feverish searching, you discover an issue. One of your 47 microservices is responding with a 200 OK status but returning malformed JSON that crashes client applications.

How to Monitor SSL Certificate Expiration - Complete 2025 Guide

Nowadays, it is very essential to keep your website secure. One of the simplest yet overlooked ways to protect your website is by monitoring SSL certificate expiration. Many website owners do not realise how quickly an SSL certificate expired. This can be damaging for the website. Imagine if you wake up in the morning and get to know that the red “Not SECURE” sign appears to the visitors. This is going to create a bad impression on the audience.

The Ultimate Guide to Kafka Monitoring Best Practices, Metrics, and Tools

If you’re operating modern, data-driven applications—which, let’s face it, you likely are—Kafka serves as the central streaming platform, delivering data in real-time. It’s impressive, extremely fast, and exceptionally powerful for achieving high throughput and scalability. But here’s the catch: with significant power comes the need for vigilant oversight. Neglecting your Kafka environment is like driving a racecar with your eyes closed. It’s bound to end badly.

Microsoft Teams outage on December 19, 2025

On December 19, 2025, Microsoft Teams experienced a performance degradation that affected communication for various users. Despite a significant volume of reports from the community, official health dashboards remained in a normal status throughout the event. This incident serves as a case study for why IT teams benefit from secondary monitoring sources.

Looking back at 2025: Innovations that shaped DevOps and observability

The year 2025 has been exciting for Site24x7, packed with innovations designed to make monitoring smarter, faster, and more intuitive. From enhanced APM insights and deeper database observability to a more powerful log management experience and AI-driven plugin enhancements, we’ve focused on giving teams the tools they need to troubleshoot faster, gain clearer insights, and manage complex environments with ease. Let’s rewind and see our 2025 highlights.

ManageEngine recognized in the 2025 Gartner® Magic Quadrant for Digital Experience Monitoring!

ManageEngine has been recognized in the Gartner® Magic Quadrant™ for Digital Experience Monitoring (DEM), affirming our commitment to delivering superior enterprise service and user experience.

Five Worthy Reads: Greener IT Starts Here: How AI Is Transforming Operational Efficiency

Five worthy reads is a regular column on five noteworthy items we’ve discovered while researching trending and timeless topics. This week, we learn more about how AI transforms sustainability in the enterprise. As enterprises accelerate their digital journeys, IT teams are under pressure to deliver faster, smarter, and more sustainable operations.
Sponsored Post

Free Versus Paid Monitoring Tools

Choosing the right monitoring strategy is critical in today's hybrid IT environments. This whitepaper explores open-source, commercial, and hybrid approaches through real-world scenarios, highlighting trade-offs in cost, flexibility, compliance, and operational efficiency. Learn how organizations of all sizes optimize observability, integrate legacy and cloud-native systems, and scale monitoring with confidence.

Elastic and Google Cloud's powerful partnership in 2025

In 2025, Elastic and Google Cloud created a powerhouse of AI-driven insights, providing an end-to-end search, observability, and security journey for our joint customers. We continue to partner on many opportunities for success and have made even further progress this year to empower all our users, especially around generative AI (GenAI). This blog highlights our collaboration with Google Cloud to help you harness the power of data at scale as well as our top moments from Google Cloud Next ‘25.

Synthetic Monitoring & WooCommerce: Detecting Hidden Failures

WooCommerce powers a massive portion of the internet’s commerce layer, largely because it looks simple. Install a plugin, connect Stripe, choose a theme, and suddenly WordPress becomes a store. That perceived simplicity is also what makes WooCommerce fragile in production. WooCommerce stores are not single systems.

Accelerating Sentinel data lake deployment | Webinar | VirtualMetric & Microsoft

Microsoft Sentinel data lake is becoming a core component of modern security architectures. In this on-demand webinar, Microsoft and VirtualMetric discuss how security teams can approach Sentinel data lake adoption to improve visibility, control cost, and prepare their data for AI-driven security workflows.

A FinOps engineer's guide to governing custom metrics

This guest blog post is authored by Dieter Matzion, a seasoned cloud practitioner who has operated exclusively in public cloud environments since 2013, with experience at leading technology companies including Google, Netflix, Intuit, and Roku. Custom metrics play a crucial role in enabling teams to monitor their applications and businesses. The flexibility of these metrics allows engineers to measure what matters most to their domain.

Turning errors into product insight: How early-stage teams can connect engineering data to user impact

Early-stage engineering teams ship fast and learn in production. While speed is a competitive advantage, it can also lead to a high volume of noisy signals, like stack traces, timeouts, and dashboards full of red. Some of those problems can affect your users and revenue, but many don’t.

Why You Need "Always-On" Website Tracking This Holiday Season

Holiday shoppers are notoriously impatient, and in 2025, they’re increasingly impatient when it comes to slow websites. Keywords like “website downtime tracking” and “ecommerce site reliability” are often trending because businesses are realizing that slow is the new down. This holiday season, the goal is to safeguard your website against business-critical slowdowns without adding “manual monitoring” to your already busy plate.

How to share and analyze survey data (or other business metrics) in Grafana

Our annual Observability Survey provides some great insights on the state of industry and all things observability. And for the third edition of the survey, published last March, we wanted to bring the results into a Grafana dashboard—not just because we could, but because it was quite a nice way to interact with the data. After all, Grafana isn't just for IT observability. You can use it to monitor everything from BI data to lunar landings to pet pythons—and now, survey data.

Site24x7 2025 wrapped: Full-stack monitoring scale

2025 was Site24x7’s year of scale and intelligence. We stood by IT teams everywhere, keeping their world running with visibility that never misses a beat. We’ve evolved monitoring into a heartbeat, one that pulses with the rhythm of your business. AI-powered. Built for the future. Security built IN, not AROUND. Full-stack visibility that scales with you. At Site24x7, we’re committed to redefining what observability means for modern IT operations — unifying metrics, traces, logs, and real-user insights across clouds, containers, networks, and applications.

VictoriaMetrics 2025 Developer Experience: A Year in Review

2025 was a landmark year for VictoriaMetrics — defined not only by product improvements, new capabilities, and wider adoption, but by a strong and consistent presence across the global open-source and cloud-native ecosystem. Our mission has always been clear: to build open-source monitoring and observability solutions that are simple, reliable, and efficient for metrics, logs, and traces.

Spotify outage on December 17, 2025

On December 15, 2025, Spotify experienced a widespread outage that disrupted playback, logins, and app functionality for users around the world. While Spotify’s official status page remained silent throughout the incident, StatusGator detected the problem early using real user signals and issued an Early Warning Signal within minutes.

Reflecting on a year of smarter network monitoring: 2025

This year, the world leaned heavily on words like reimagine, rebuild, renew, reshape, and reinvent, and the same spirit defined our journey. As promised last year, we reimagined key capabilities, reshaped workflows, and restructured critical parts of our network monitoring tool to meet modern demands. At the same time, we reinforced the core foundation you've trusted for more than a decade: delivering reliable, usable features with uncompromising security.

Elevate your MSP operations: Key Site24x7 features you shouldn't miss in 2025

Managing multiple customer accounts as an MSP can be overwhelming. With the constant demands of configuring monitors, generating reports, and maintaining security across numerous customers, efficiency becomes critical. Throughout this year, we've focused on making your life easier with powerful new features that automate repetitive tasks, enhance security, and give you better visibility into customer health.

Notes from the Field: Migrating from VMware to XenServer

The customer was already using Citrix Provisioning Services (PVS) to deliver Virtual Delivery Agent (VDA) machines. Rather than attempting to migrate existing VMware-based VDAs, which often introduces driver conflicts and legacy dependencies, we followed a proven best-practice approach. We provisioned new VDA machines directly on XenServer using the PVS Virtual Desktop Setup Wizard. This ensured clean builds, free from VMware-specific components, and fully optimized for the XenServer platform.

Confessions of a software engineer who enjoyed being paged at 5am

It’s 5:14am, and I wake up to the squawking geese sound of my PagerDuty alert (anyone else have this sound? No?). I’m four months into working for my new team as a junior software engineer, and this is my first time being paged in the middle of the night. Most software engineers probably dread this moment, but I kind of love it. Agile ceremonies and Jira tickets suddenly don’t matter, and you’re fully focussed on stopping a customer-impacting fire.

Day 2 with Cilium: Small configurations that keep large clusters boring

Operating Cilium at a small scale is straightforward. You install the Helm chart, choose a routing mode, and apply a few network policies. Day 1 is about getting packets to flow. Day 2 is about keeping them boring. At Datadog, we run Cilium across hundreds of Kubernetes clusters, tens of thousands of nodes, and hundreds of thousands of pods in multiple clouds. When operating at this scale, small configuration choices stop being minor details and start becoming risk multipliers.

Elastic at AWS re:Invent: Concluding a year of partnership in agentic AI innovation

Highlights of another laudable year of customer-centric collaboration The integration of Elastic’s capabilities, including vector databases and context engineering, with AWS services helps customers build intelligent, scalable, and secure applications faster and with greater flexibility. Our ongoing collaboration has resulted in another year of notable innovation with AWS. This blog highlights our continued collaboration with AWS throughout 2025 to help you capitalize on the power of AI.

Python memory profiling: Common pitfalls and how to avoid them

Continuous profiling has established itself as core observability practice, so much so that we’ve referred to it as the fourth pillar of observability. But despite the capabilities and growing adoption of continuous profiling, it can still be confusing to approach profiling as a newcomer and correctly apply it to different troubleshooting scenarios.

Application Monitoring 101: Queue Time Can Alert Before a Breakdown

Regular monitoring practices can emphasize application response time, but queue time is also often an early and important warning sign. If it rises, you’ll quickly see downstream effects: tail latency, timeouts, and error spikes. This means that this metric can give you a head start tackling app issues before they become user problems. In this post, we’ll discuss queue time, how things can go off track, and practical steps to turn it around.

Gartner I&O and Cloud Strategies Conference 2025: From Observability to Outcome-Driven Operations

This year’s Gartner IT Infrastructure, Operations and Cloud Strategies Conference made one thing abundantly clear: the industry is moving beyond reactive monitoring and isolated dashboards toward autonomous, outcome-driven IT operations. While AI and agentic automation dominated keynotes and vendor messaging, conversations on the show floor reflected a more grounded reality.

Centrally set up and scale monitoring of your infrastructure and apps with Datadog Fleet Automation

Setting up and scaling observability across large, distributed environments often requires platform and SRE teams to coordinate access to infrastructure hosts and switch between configuration management tools and product-specific documentation. These tasks increase setup time and create delays in establishing visibility of critical services in Datadog. As teams expand their infrastructure, they need to coordinate Datadog configuration changes in a consistent and auditable way.

Text-to-Alert: Generating Netdata Alerts from Natural Language

Netdata has an incredibly powerful alerting engine. But this can sometimes be a double-edged sword: the flexibility to build incredibly specific, intelligent alerts is immense, but mastering its syntax can feel like learning a new language. We’ve heard this from so many of you. You tell us that configuring alerts is often the steepest part of the learning curve, a task that falls to the one “Netdata expert” on the team who has spent the time digging through the documentation.

A Year in Internet Analysis: 2025

This year-end wrap-up covers topics from BGP security (including ASPA and excessive AS-SETs) and the geopolitical (Ukraine’s IPv4 exodus, the Iran internet shutdown, and Red Sea cable cuts) to the year’s most significant outages (TikTok, the Spain/Portugal blackout, and cloud failures at AWS, Azure, and Cloudflare). Plus, we explore Starlink’s new Community Gateways, and revisit the evolving landscape of AS ranking and OTT service tracking.

Tail sampling vs. head sampling in distributed tracing

In this video, Grafana Labs' Robin Gustafsson (CEO for K6 + VP, Product) and Sean Porter (Distinguished Engineer) discuss the differences between head sampling and tail sampling approaches in distributed tracing. They explore why head sampling often amounts to sampling randomly and hoping for the best, while tail sampling — the approach used by Adaptive Traces in Grafana Cloud — allows you to intelligently capture the traces that actually matter to you.

Logging Best Practices (Grafana OpenTelemetry Community Call)

We’re back with a new Grafana OpenTelemetry Community Call episode, and this time we’re diving into logging with OpenTelemetry and Grafana Loki! Even better, we’re joined by two fantastic guests: Jack Berg, OTel logging expert, and Ed Welch, Loki guru. Getting both of them in one conversation makes for an amazing deep-dive into all things logging. Logs come in every shape and size, from simple CLI output to massive distributed systems generating petabytes of structured data. In this episode, we’ll talk about.

Building a Code Review system that uses prod data to predict bugs

This post takes a closer look at how Sentry’s AI Code Review actually works. As part of Seer, Sentry’s AI debugger, it uses Sentry context to accurately predict bugs. It runs automatically or on-demand, pointing out issues and suggesting fixes before you ship. We know AI tools can be noisy, so this system focuses on finding real bugs in your actual changes—not spamming you with false positives and unhelpful style tips.

Site24x7's Kubernetes monitoring | Proactive, scalable, AI-powered

Kubernetes drives modern cloud-native applications, but its distributed nature creates visibility and performance challenges at scale. In this video, discover how Site24x7 provides real-time monitoring, AI-powered anomaly detection, and scalability for Kubernetes environments, helping you to proactively manage resources and resolve issues faster. Key features of Site24x7 Kubernetes Monitoring: Whether you're running a single Kubernetes cluster or managing multiple environments, Site24x7 helps you ensure peak performance and faster decision-making with minimal manual intervention.

What is DEX? And Why DEX is Important

Digital Employee Experience (DEX) refers to how employees interact with the digital tools, systems, and technologies they use at work-and how those interactions affect their productivity, satisfaction, and overall work experience. DEX encompasses the quality of the digital interactions and services that employees encounter while using workplace technologies. It includes various factors such as application performance, network connectivity, device usability, and overall user satisfaction.

Debug Faster with Chrome + Rollbar Debugging Assistant

Context switching is one of the biggest hidden productivity killers in debugging. Jumping between multiple open browser tabs slows momentum and increases cognitive load, especially when you’re trying to diagnose an issue under pressure. Google Chrome's new split screen feature, paired with Rollbar Debugging Assistant, enables a faster, more focused way to troubleshoot errors without constantly losing your place.

From Zero to Open Source Contributor

Never contributed to open source and feeling intimidated? Same. Before joining Datadog, Alessandro had zero open source experience. Now he's a regular contributor to Apache Iceberg. Here's exactly how he got started. Step 1: Join the Slack community and answer user questions. Step 2: Look for "good first issue" tags in the repo. Step 3: Remember that opening bug reports and doing code reviews count as contributions too.

The Observability Stack is Collapsing: Why Context-First Data is the Only Path to AI-Powered Root Cause Analysis

By Bill Balnave, VP of Customer Success at Mezmo The core promise of modern observability is simple: cut Mean Time To Resolution (MTTR). Yet, despite a boom in tooling and investment over the last four years, the data tells a sobering story: our industry is actually getting worse at finding and resolving issues. Dashboards, once our trusted guide, have become the starting point for a chaotic "dashboard hunt" that rarely leads to the definitive root cause.

What's New in InfluxDB 3.8: Linux Service Management, Kubernetes Helm Chart, and Smarter Ask AI

InfluxDB 3.8 is now available for both Core and Enterprise, alongside the 1.6 release of the InfluxDB 3 Explorer UI. This release is focused on operational maturity and making InfluxDB easier to deploy, manage, and run reliably in production. InfluxDB 3 Core remains free and open source under MIT and Apache 2 licenses, optimized for recent data. InfluxDB 3 Enterprise builds on that foundation with long-range querying, clustering, security, and full operational tooling.

How To Connect Your Prometheus Server to a Grafana Datasource

Prometheus is one of the most popular open-source monitoring systems in the world. It’s lightweight, easy to deploy, and pairs beautifully with Grafana for dashboards and alerting. If you're running applications or infrastructure on Linux, Prometheus plus one of many Exporters (Redis, NVIDIA GPU, Nginx, etc.) gives you deep visibility into service performance - quickly and reliably.

Why OpenTelemetry instrumentation needs both eBPF and SDKs

As a vendor-neutral open standard, OpenTelemetry has become the default choice for application instrumentation. However, it’s important to remember that OpenTelemetry isn’t a single technology — it’s an ecosystem. Under the hood, it provides multiple options for instrumenting your applications. In this blog post, we explore two instrumentation approaches: OpenTelemetry eBPF Instrumentation and runtime-specific OpenTelemetry SDKs, like the OpenTelemetry Java agent.

Episode 3 - Where AI Meets Legacy Systems

In this episode of The Intelligent Enterprise, host Tom Stoneman gets inside a challenge many enterprises are facing right now: how to integrate AI with complex legacy systems without breaking what already works. This week, Tom sits down with Yael Gómez, Fractional Chief Technology Officer and Chief Information Officer at Pet Madness, and former technology leader at Walgreens Boots Alliance.

Capture high-value traces without managing a pipeline: Tail sampling with Adaptive Traces

Tracing is the richest observability signal in common use today. In distributed systems, it reveals how requests flow across multiple services, allowing you to uncover and address performance bottlenecks. Teams often scale back or abandon tracing altogether, however, because most successful requests produce redundant data that’s noisy and expensive to store.

7 Strategies for IT Ops Teams to Monitor and Optimize Real-Time Commodity Pricing Systems for Financial Reliability

Real-time commodity pricing systems have become mission-critical infrastructure for financial institutions, trading desks, and enterprise resource planning operations. As of December 2025, with 72% of trading firms migrating to cloud-native CTRM and ETRM platforms, IT Ops teams face mounting pressure to maintain pricing accuracy, minimize latency, and ensure system resilience during volatile market conditions.

Top SaaS Vendors DevOps Teams Should Monitor in 2025

Modern applications rely on dozens of third-party services to function properly. When these services fail, your application fails too. DevOps teams need to identify and monitor the top SaaS vendors that could impact their infrastructure and user experience. This guide covers the essential SaaS vendors DevOps teams should monitor, organized by category and criticality. We'll explore why each vendor matters and what specific aspects require monitoring.

IT infrastructure monitoring: Leaner, stronger, more intelligent, and a huge progression

IT infrastructure as a technology has leapfrogged in 2025 and Site24x7 is no exception. CTOs, SREs, sysadmins, and other IT personnel wanted more from server monitoring and observability tools—and we stepped up. We listened to you, the industry leaders in your respective spaces, and re-envisioned our product platforms. The result?

Cloud observability in focus: How Site24x7 strengthened cloud monitoring in 2025

Cloud monitoring became more important as enterprises scaled distributed systems, multi-region deployments, and hybrid environments. Teams needed better cloud performance insights, clearer resource usage visibility, and stronger automation to prevent outages and control costs. This year, Site24x7 delivered a rich set of cloud monitoring updates across AWS, Azure, GCP, and OCI, helping teams stay ahead of issues and optimize their cloud footprint.

The Hidden Costs and Concerns of Iceberg Maintenance

Everyone talks about how great Apache Iceberg is, but nobody warns you about this: without proper maintenance, your tables will bloat, queries will slow down, and your catalog will run out of memory. Here are the 4 critical operations you MUST run regularly. Expiring snapshots prevents metadata bloat (Datadog learned this the hard way with catalog memory pressure). Deleting orphan files cleans up failed writes. Compacting data files keeps streaming workloads fast. Compacting manifests optimizes query planning.

Improve log utilization with Datadog log exclusion filters | Datadog Tips & Tricks

Want to make your logs easier to work with? Excluding unneeded logs from indexing reduces noise and may reduce log management costs. In this video, you’ll learn how to: See for yourself how to improve log utilization with Datadog Log Patterns and log exclusion filters. Then set up an alert to track ingestion spikes.

Creating the IPM Category: Catchpoint's Journey to Leadership and the LogicMonitor Era

On December 15, 2022, Catchpoint launched Internet Performance Monitoring (IPM) as a new category for monitoring solutions with our foundational article, “What is Internet Performance Monitoring and How is it Different from APM?” In it, we said: How prophetic those words turned out to be.

The 2026 VMUG Report: Why Network Observability is the Heart of the New VCF Era

The cloud landscape is no longer just about "getting to the cloud"—it is about mastering the complexity once you are there. For organizations using VMware Cloud Foundation (VCF), the stakes have never been higher. As infrastructure converges, the margin for error shrinks, and the need for precision grows. To understand how the industry is navigating these changes, we dive into the VMUG Cloud Operations and VCF User Experience Report 2026.

Why 2025 Shattered the Old Rules of Network Management

December has arrived. The change freeze is looming, and the holiday requests are likely piling up in your inbox right now. It is the natural time for you to look back at the last twelve months, not just to measure your team's performance, but to consider how much the game itself has changed. If you look at the trajectory of your industry this year, a clear pattern emerges. You didn't just face new technical challenges; you faced a genuine shift in what it means to manage a network. The old metrics broke.

Migrating from SolarWinds to WhatsUp Gold: The Ultimate Guide

Looking for a reliable SolarWinds alternative? If you’ve been comparing SolarWinds and WhatsUp Gold, you’re not alone. Many IT teams are evaluating which network monitoring solution offers better performance, flexibility and cost efficiency. In this guide, we’ll walk you through a step-by-step migration from SolarWinds to WhatsUp Gold, highlighting key differences, benefits and best practices to maintain a smooth transition.

About us - Sumo Logic

Security teams are flooded with thousands, or even millions, of signals every day. Sumo Logic’s entity-based SIEM and Dojo AI agents automate the manual work of detection, triage, and remediation so you can act faster on the alerts that matter. Discover how Sumo Logic simplifies security operations, helping you cut through the noise and protect your digital world.

OpenTelemetry Agents - The Complete Beginner's Guide (2025)

If you search for “OpenTelemetry Agent”, you will likely encounter two completely different definitions. This ambiguity often leads to confusion between infrastructure teams and application developers. SREs and DevOps engineers would describe it as a component deployed as a sidecar, whereas application developers would understand it as a language-specific library. Let’s break it down in the next section.

.NET Web API Monitoring: REST, ASP.NET & WCF Compared

Modern.NET applications rely on three primary Web API architectures: lightweight REST APIs, middleware-driven ASP.NET Core Web APIs, and contract-heavy WCF SOAP services. Each exposes functionality over HTTP, but each behaves very differently in production. More importantly, each architecture fails in different ways, which means teams must monitor them differently to maintain reliability, uptime, and predictable performance.

Easy-to-Use SSL Certificate Management Tool: The Complete Guide

Managing SSL certificates can feel complicated for many teams. It often involves remembering renewal dates, keeping track of certificate details, ensuring that websites stay secure, and avoiding costly downtime caused by expired or invalid certificates. While many people call these solutions “SSL certificate management tools,” most organizations do not need a full PKI automation platform.

Synthetic Application Monitoring: Proactive Strategy to Prevent Downtime

Imagine this: It’s three in the morning on Black Friday. Your phone appears with alerts, your online store’s checkout isn’t functioning properly. Your team is in a panic, sales are dropping by the minute, and social media is full of complaints from your clients. Determining that the problem is an expired third-party payment gateway means you’ve lost hours of sales and your customers’ trust.

Mobile App Synthetic Monitoring enables proactive testing across devices and networks

In the mobile-first digital economy, your application’s performance is your brand’s frontline. Your backend is fast. Your APIs respond in milliseconds. Yet, somewhere on a slow network in a bustling city center, a user is staring at a frozen login screen. This scenario highlights a critical truth. App synthetic monitoring is the proactive discipline of simulating real user interactions—like app launches, logins, searches, and checkouts—from real devices and networks worldwide.

How to Connect Your MySQL Instance to a Grafana Datasource

Grafana’s MySQL datasource makes it easy to turn raw database rows into clean, interactive dashboards. Whether you're testing out a new monitoring setup or experimenting with time-series data, MySQL + Grafana gives you a powerful foundation for building visualizations quickly.

Gisual Enters the Stack: Power, AI, and the Next Phase of Observability

ScienceLogic recently partnered with Gisual—a leader in AI power intelligence­—to bring real-time power insight directly into the ScienceLogic AI Platform. On the surface, that might sound like a straightforward integration story. In reality, it signals something much bigger: observability continues to expand well beyond the digital stack, and operators now treat power as a first-class operational signal.

[Workshop] Building and Monitoring AI Agents and MCP servers

​See how Agent Monitoring gives you a better look at all things model usage, call duration, prompting, and more ​Go under the hood with MCP Monitoring - and learn how to debug client connection issues, tool call performance, transports, and all things MCP ​When things start breaking, use Seer, Sentry's AI Debugging Agent to troubleshoot those vague issues that are crashing and get help from a team of robots using Sentry’s AI PR Review.

Monitor and reduce your mobile app size with Size Analysis (beta)

Note: This blog post was originally published for the Early Access of Size Analysis. if you're already familiar with Size Analysis in Sentry, go to the section titled What's new in the beta. If you're not familiar with Size Analysis, start at the section titled The curious case of man.jpg.

Instrumentation Hub: a guided, scalable way to roll out observability coverage without losing control

Getting started with observability in a modern, fast-moving environment is harder than it should be. Open-standards-based observability promises flexibility and vendor neutrality, but in practice it often introduces significant complexity and delays meaningful coverage by months or even years. Each layer of the stack requires its own instrumentation approach, and every technology, runtime, and library version comes with unique setup steps, tradeoffs, and rough edges.

The year in AI at Grafana Labs

2025 was the year we at Grafana Labs went all-in on AI—and boy, what a year it was. Not only did we establish and start to execute our overarching strategy (build actually useful AI), we also took one of our most exciting new features (Grafana Assistant) from idea to general availability in just nine months! Yes, there's no shortage of articles singing the praises of AI these days, but let's dispense with the hyperbole and focus on some actually useful content.

Nexthink Recognized as a Customers' Choice in Gartner Peer Insights Voice of the Customer for Digital Employee Experience Management Tools

We’re thrilled to share the exciting news that Nexthink has been recognized as a Customers’ Choice in the inaugural 2025 Gartner Peer Insights Voice of the Customer for DEX Tools. In our view, what makes this recognition truly special is that it comes directly from the people who know our platform best – the IT leaders who use Nexthink every single day. Apart from this, we are recognised as a Leader in the Gartner Magic Quadrant for DEX Management Tools for the second consecutive year.

Spotify's performance & control across large monitoring environments with VictoriaMetrics

When your active time series is in the billions and the total number of data points you need to monitor runs into the tens of trillions, you need a high-performance observability solution with operational simplicity. Streaming behemoth Spotify is one such case. Their observability team chose VictoriaMetrics as the fastest monitoring and observability solution on the market.

Tech Talk - Splunk Observability for AI

In this Tech Talk, we’ll show you how Splunk’s agentic, AI observability delivers end-to-end visibility of the entire AI stack, from agents and large language models (LLMs) to the underlying infrastructure. You’ll see how AI Infrastructure Monitoring provides teams with data-dense dashboards and detectors for surfacing trends, patterns, and outliers to correlate application health with underlying AI infrastructure performance.

Tech Talk - Take action automatically on Splunk alerts with Red Hat Ansible Automation Platform

As digital and AI applications become more prevalent, the need for fast, efficient, and consistent management of IT operations is critical. This session will show you how to automate responses to Splunk Observability Platform alerts using Red Hat Ansible Automation Platform's Event-Driven Ansible.

Setting up OpenTelemetry Demo in Kubernetes with Splunk Observability Cloud

Are you looking to explore the power of OpenTelemetry and Splunk Observability Cloud in a Kubernetes environment? This video provides a comprehensive, step-by-step walkthrough on how to deploy the OpenTelemetry Demo application in Kubernetes and seamlessly integrate it with Splunk Observability Cloud for metrics, traces, and logs! In this tutorial, you'll learn.

Building visibility and resilience across Kubernetes

Kubernetes has transformed how modern applications are deployed and scaled. Its flexibility and automation power innovation but also expand the attack surface. From control plane access to runtime drift, Kubernetes introduces layers of complexity that can obscure visibility if not properly monitored. For security leaders, Kubernetes is both an opportunity and a risk. While it enables agility, it also decentralizes security responsibility across teams, tools, and cloud layers.

Heroku vs. Kubernetes

If you are deciding where to deploy a web app, you will almost always run into a choice between a platform like Heroku and running on Kubernetes. This article will compare Heroku and Kubernetes. They are two popular platforms for deploying and managing applications. This article breaks down the key differences in architecture, use cases, complexity, cost, and scalability to help engineers choose the right go-to platform for their needs.

Introducing the Databricks Destination: Powering governed, scalable analytics from day one

Modern enterprises are generating more high-volume observability and security data than ever, which means the cost and complexity of getting analytics-ready data into Databricks are only growing. With the new Databricks Destination for Cribl Stream, organizations finally have a governed, scalable, and cost-efficient way to take full control of their data pipelines, accelerate AI-driven analytics, and unlock real business value from their Databricks investment.

ServiceNow and Grafana: How to receive Grafana alert payloads via ServiceNow's scripted REST API

When you integrate Grafana-managed alert rules with ServiceNow, you can automatically capture and process alerts in ServiceNow’s events table—a common entry point for incident workflows, escalations, and ticket creation. And if you configure ServiceNow to receive Grafana Alerting payloads using ServiceNow’s scripted REST API, you can parse Grafana’s JSON alert payloads and insert them into a ServiceNow table.

GEOFF WRIGHT RETURNS: 2025 EOY SPECIAL EPISODE!

In our tradtional end-of-year DEX Show special episode, Mondelez’s Geoff Wright returns to unpack a wild 2025 for IT, AI and employee experience. Tim, Tom and Geoff riff on AI agents that shop, plan travel and work across your browser tabs, the coming street fight between Windows and Chromebooks, and why younger workers just want a browser and to be left alone. Geoff explores shadow AI, culture and the human resistance to change, plus his Q1 predictions: Google’s big enterprise push, soaring laptop costs, and why experience, empathy and a good laugh still matter more than any shiny new model.

Synthetic End User Monitoring simulates complex user journeys across global environments

Traditional monitoring solutions provide valuable infrastructure metrics, they fundamentally lack the capability to understand what users actually experience. There is a significant technical gap between server-side metrics and client-side experience. Research shows that traditional monitoring fails to detect 52–68% of user-facing errors since they happen outside of the server infrastructure.

Setup and Explore OpenTelemetry Demo Application (with Examples)

Everyone knows that debugging is twice as hard as writing a program in the first place. So, if you’re as clever as you can be when you write it, how will you ever debug it? — Brian W. Kernighan and P. J. Plauge, The Elements of Programming Style, 2nd ed. Maybe you can let SigNoz do some heavy lifting for you!

Best Certificate Monitoring Solutions With Slack/Teams Integration: The Complete Guide

SSL certificates expire silently. When they do, websites instantly break. Users see warnings. Traffic drops. Security trust is damaged. This is why businesses now rely on certificate monitoring solutions that send alerts before a certificate expires. A growing number of teams want these alerts directly inside Slack or Microsoft Teams, because that’s where their operations already work every day.

Training Foundation Models on a Trillion Data Points with Apache Iceberg

Training an AI foundation model on over a trillion data points sounds impossible without hitting your production systems. Here's how Datadog did it with Apache Iceberg for their time series forecasting model TOTO. The key challenge: extracting massive historical observability data (metrics spanning years) and running incremental preprocessing pipelines without overwhelming production services. Iceberg solved this by providing schema governance, consistency guarantees, and seamless integration with ML tools like Ray and PyTorch.

Why Monitoring the Physical Environment Matters: From Data Centers to Factory Floors

Physical environment monitoring is the practice of measuring and tracking environmental conditions that directly affect equipment, people, and operational continuity. While digital systems dominate modern operations, physical conditions still determine whether those systems perform reliably or fail unexpectedly. A single temperature spike, humidity imbalance, or power fluctuation can undo layers of software redundancy.

What broke during the Trello outage on December 12

In the early hours of December 12, 2025, Trello experienced a disruption that affected teams around the world. Users began reporting that boards would not load, workspaces were inaccessible, and error messages appeared without warning. For a period of time, Trello’s official status page continued to show normal operations, even as real world usage indicated otherwise.

Save the logs, save the planet: How to make your observability stack greener

If data centres were a country, they’d rank fifth in electricity consumption by 2026. Over the past few years, the resulting carbon footprint of the technology industry has sparked the fast-growing green software movement, led by the Green Software Foundation. How can we continue to innovate software in a way that also minimises its impact on the environment? This has been a fascinating problem I’ve been exploring for a few years now.

Bright Ideas: Measuring the ROI of AI Adoption in Financial Services

If there is one truth I have learned working with financial services firms in 2025, it is this: AI is no longer optional, it is operational. From risk modeling to customer experience, algorithmic trading to automated compliance checks, AI is now embedded into the fabric of modern finance. But there is a second, quieter truth. AI only creates value when it is used responsibly, measurably, and at scale.

VictoriaMetrics Achieves Red Hat OpenShift Operator Certification

VictoriaMetrics has achieved Red Hat OpenShift Certification, awarded to Red Hat partners who meet requirements for delivering a scalable, supported, and secure operator designed for enterprise cloud deployments. VictoriaMetrics available on the Red Hat OpenShift OperatorHub The program certified VictoriaMetrics as a solution that allows for portability and operational efficiency across hybrid and multi-cloud environments.

OpenTelemetry Metrics with 5 Practical Examples

Picture this, your observability tool already nails the basics like request rates, latency and memory usage, but you need more insight. Think user churn rates, engagement spikes, or even how many carts get abandoned mid-checkout. That’s where OpenTelemetry steps in, providing a way to track those critical custom metrics with ease.

How Inkeep Monitors Their AI Agent Framework with SigNoz

AI agents are fundamentally different beasts to monitor compared to traditional applications. A single user request can trigger a cascade of 10+ internal operations: sub-agent transfers, tool executions, LLM calls, API requests, each with unpredictable latency and failure modes. When something goes wrong (and with LLMs, things go wrong in creative ways), you need to see the entire execution flow to debug effectively.

What Broken Checkouts Really Cost: Why Transaction Monitoring Pays For Itself

Broken checkouts lead to lost transactions, drain revenue, undermine customer trust, and damage brand credibility. Unfortunately, most companies don't realize their checkout is failing until sales drop or customers start complaining. According to statistics, technical issues cause checkout abandonment in at least 17% of cases. This means nearly one-fifth of lost conversions are preventable. For any online business, even a small checkout failure can result in significant revenue loss.

How to Use MCP to Optimize Your Graylog Security Detections

Security teams face a critical question: “What logs should we collect, and what detections should we enable to protect against threats targeting our industry?” For a bank in the northeast, this isn’t academic. Threat groups like FIN7, Lazarus Group, and Carbanak specifically target financial institutions with sophisticated attacks ranging from SWIFT compromise to ransomware.

Overcoming ClickHouse's JSON constraints to build a high-performance JSON log store

Customer logs data is always messy. Being (and building!) an observability platform, we get to see all the beautiful, creative ways it can be messy, every single day. And yet, our customers expect, quite fairly, I might add, perfect query results and peak performance. Info SigNoz is an open-source observability platform that can be your one-stop solution for logs, metrics and traces.

How to Track Cloud Costs in Real-Time Instead of Waiting Days

Tired of waiting days to see your AWS bill spike? Datadog solved this problem using Apache Iceberg to deliver real-time cloud cost visibility - updating every 15 minutes instead of waiting for billing data. Here's how it works: They sync real-time resource inventory (EC2 instances, Kubernetes pods) into Iceberg tables, then use Trino to join those snapshots with unit pricing data. The result? FinOps teams can catch cost anomalies before they become budget disasters.

AI Observability in 2026: Why the data layer means everything

If there was ever a year for AI observability, it was 2025. Vendors released assistants to cover a variety of use cases. Coralogix released the first agent (distinct from assistants!), Olly, an autonomous, multi-agent observability platform. The direction of travel is clear, but many vendors and users are about to run into some significant problems with their data layer.

Complete web Performance Strategy with Web Synthetic Monitoring

You’ve optimized your code, implemented caching strategies, and configured your CDN perfectly. Your analytics dashboard shows respectable load times, and your development team reports everything is running smoothly. Yet, conversion rates remain stagnant, bounce rates climb during peak hours, and your competitors consistently outperform you in user experience metrics. What’s missing?

Top OpenTelemetry Backends for Storage & Visualization

OpenTelemetry backends provide storage, analysis, and visualization for telemetry data (traces, metrics, logs). This guide lists available OpenTelemetry-compliant backend options, categorized by use case: APM platforms, storage backends, visualization tools, and distributed tracing systems. For detailed comparison, see OpenTelemetry Backend Comparison.

Accelerating IT Transformation with Agentic AI

As enterprises face increasing pressure to manage vast and complex IT environments, the demand for faster and more efficient IT management is rising. Traditional operating methods are proving insufficient, making the adoption of Agentic AI essential for organizations aiming to achieve truly autonomous IT operations. This innovative technology enhances decision-making and enables businesses to remain agile in a rapidly evolving digital landscape.

From performance to impact: Bridging frontend teams through shared context

Connecting day-to-day development work to real user outcomes can be challenging. As a result, engineers and product teams often struggle to effectively prioritize projects together. While the goal of improving user experience (UX) is the same, each team relies heavily on different—and often siloed—forms of monitoring to understand their app, creating a disconnect in metrics and visualizations that can be hard to communicate.

Monitor your Kubernetes operators to keep applications running smoothly

The performance of your Kubernetes operators often influences the behavior of the applications they manage. Operators automate the day-to-day management of your applications by executing critical activities, which may include scaling replicas, performing upgrades, and recovering from failures. For example, a PostgreSQL operator can ensure that standby servers are always deployed, that the database’s failover is correctly configured, and that data is backed up on schedule.

Beyond the Dashboard: Integrating Network Monitoring with Your IT Ecosystem

Discover how Progress WhatsUp Gold network monitoring can be extended with built-in and community-driven integrations by joining us for our webinar, Beyond the Dashboard: Integrating Network Monitoring with Your IT Ecosystem. Our product experts will showcase: NetBox-WUG Sync for automated asset management WhatsUp Gold PS PowerShell module for scripting with the REST API Native and custom integrations with ServiceNow, Microsoft Teams and Slack.

Reporting Exceptions to Honeycomb with Frontend Observability

So you've built a client application and you've started sending telemetry. The information sent back by this client is vital to you, and one of the first things you care about is capturing and reporting errors. There are at least two ways to report error details in OpenTelemetry. Web applications generally place exceptions in trace spans as span events, and mobile applications send exceptions as log messages instead.

Lean Operations for a Fragmented Middleware World: Why Efficiency, Resilience and Compliance Now Depend on a New Model

Fragmented middleware estates create hidden costs, operational drag, and growing compliance risk. Learn why lean operations, unified visibility, and built-in auditability are now essential for modern messaging and streaming environments.

Graylog Guided Demo

Have a sneak peek at Graylog V7.0. Graylog V7.0 introduces a major step forward in speed, usability, and visibility across your entire security and operations workflow. In this demo, we walk through the newest capabilities designed to help teams detect, investigate, and respond faster than ever. You’ll see how the updated interface streamlines daily tasks, how the enhanced search and pipeline tools simplify complex data handling, and how powerful additions like built-in correlation and modernized dashboards give you clearer insight with less effort.

Scrapers Take Down GitHub: December 11 Outage Timeline

On December 11, 2025, GitHub experienced intermittent disruptions that frustrated users across the globe. Developers everywhere started seeing random errors, 503s, unicorns, and CI pipeline failures. Very quickly it became clear something was wrong, even though GitHub’s status page still said ALL SYSTEMS OPERATIONAL. After the incident was over, GitHub published a postmortem that revealed the cause: scrapers. Automated tools hit GitHub with enough traffic to overwhelm key backend systems.

The Impact of Network Downtime on Enterprise Productivity - and How Monitoring Helps

Enterprise IT teams operate under relentless pressure to maintain seamless connectivity, yet many business leaders underestimate the financial gravity of Network Downtime. Studies consistently show that even a brief outage can cost enterprises hundreds of thousands of dollars per hour, positioning downtime as one of the most disruptive threats to business continuity.

Grafana Tempo: Upcoming 2.10/3.0 Releases (Community Call December 2025)

Upcoming 2.10/3.0 Releases New maintainer, Oleg Have questions? Please bring them! Can't comment in the chat? You may need to create a channel -- you can do this by clicking your photo in the top right corner. Grafana Cloud is the easiest way to get started with Grafana dashboards, metrics, logs, traces, and profiles. Our forever-free tier includes access to 10k metrics, 50GB logs, 50GB traces and more.

Obkio 2025 Year in Review

2025 was big! This year, we stopped talking about what Obkio could be and started showing what it is: a full network observability platform built for the networks you actually run. We released features that solve real problems. We showed up where network pros gather. And we proved that a Canadian-built tool can compete with anyone. Here's what happened.

Using AI + Rollbar's Session Replay to Understand Complex Errors

Front‑end bugs are notoriously hard to reproduce. By the time an error shows up in your monitoring tool, the most important context is already gone: what the user actually did. Session replay helps—but only if someone has the time and patience to scrub through recordings, correlate events, and form a hypothesis. That’s where Rollbar’s MCP server, paired with an AI agent like Github Copilot, changes the game.

OTel Updates: OpenTelemetry Proposes Changes to Stability, Releases, and Semantic Conventions

Over the past year, the Governance Committee ran user interviews and surveys with organizations deploying OpenTelemetry at scale. A few patterns came up consistently: Stability levels aren't always obvious. When you install an OTel distribution, some components might be experimental or alpha without clear markers. This makes it harder to evaluate what's production-ready. Instrumentation libraries sometimes wait on semantic conventions.

How to Handle Cloud Monitoring Overload?

Reduce alert noise by 70% through intelligent aggregation, clear ownership boundaries, and filtering metrics that don't map to user-facing issues. Monitoring starts with a straightforward goal: understand your system's health and identify issues before users notice them. You set up metrics, create dashboards, and configure some alerts. At first, it works well. Over time, your stack gets bigger and more complicated. New services get added.

Let's Encrypt 45-Day Certificate Expiration: Monitoring & More

The move by Let’s Encrypt from 90-day certificates to 45-day certificates is more than a policy shift. It changes how teams must manage renewals, detect failures, and validate that certificates are deployed consistently across distributed systems. A shorter lifecycle compresses the margin of error. Automation that previously limped along unnoticed now breaks on a far tighter schedule. And every misconfiguration hits users faster.

How AI Agents automate incident response #ai #cybersecurity #telemetry

Clint Sharp demonstrates how Cribl Search leverages AI to streamline incident investigation. Starting from a Slack channel, the AI builds an interactive notebook, analyzes order processing logs, and identifies suspicious traffic spikes. It connects high CPU usage to a recent Jenkins deployment, hypothesizing a supply chain attack, and ultimately recommends a rollback. This isn't a far off concept. It is the future of operations arriving right now.

Why AI agents need a common data model #ai #telemetry

Clint Sharp explains why a common model like OCSF is critical for the future of AI. Agents need standardized data to analyze information effectively on your behalf. He contrasts the traditional manual workflow of checking Slack, tickets, and wikis while asking colleagues with a future where AI fuses this human context with machine data. Instead of just search results, AI agents will hand you examined hypotheses so you know exactly where to take your investigation.

How to use AI to analyze and visualize CAN data with Grafana Assistant

Note: A version of this post originally appeared on the CSS Electronics blog. Martin Falch, co-owner and head of sales and marketing at CSS Electronics, is an expert on CAN bus data. Martin works closely with end users, typically OEM engineers, across diverse industries, including automotive, maritime, and industrial. He is passionate about data visualization and AI—and he’s been working extensively with Grafana Assistant.

Using AI + Rollbar's Session Replay to Understand Complex Errors

Front‑end bugs are notoriously hard to reproduce. By the time an error shows up in your monitoring tool, the most important context is already gone: *what the user actually did*. By letting an AI agent like Copilot analyze Rollbar's session replay data directly, teams can move from *“something broke”* to *“here’s exactly why it broke”* in minutes, not hours.

How Aerospace Companies Use InfluxDB

Over the past two decades, we’ve witnessed the instrumentation of virtually everything in the aerospace industry, from manufacturing floors to satellites orbiting Earth. And it’s no longer just NASA and other government organizations leading the charge. The commercial space industry has grown exponentially, with private companies developing everything from GPS satellites to electric VTOL aircraft.

Elastic and Microsoft partnership achievements in 2025

Highlights of another successful year of customer-centric collaboration Once again, our partnership delivered an impressive year of innovation with Microsoft Azure, Azure AI Foundry, and Azure OpenAI. This blog highlights our continued collaboration with Microsoft to better serve customers throughout 2025 and our key moments at Microsoft Ignite.

Major Cloud Outages of 2025

Cloud outages in 2025 ranged from minor ones affecting some sections of users, to major ones affecting hundreds or thousands of users. Services like Cloudflare and AWS on which many other services depend experienced outages that affected many due to the cascading effect. Let's look at some of the major cloud outages in 2025.

Google SecOps Forwarder Deprecation: Migrate to Bindplane and OpenTelemetry

Google Cloud Security Operations is deprecating the legacy SecOps Forwarder, and OpenTelemetry with Bindplane is the official telemetry ingestion method. In this workshop, you’ll learn how to migrate from the SecOps Forwarder to Bindplane and OpenTelemetry Collectors, the officially supported ingestion model for Google SecOps going forward. We walk through the why, the what, and the how — with practical guidance you can apply immediately.

Agentic AI demands a new data architecture #ai #telemetry

Clint Sharp explains why traditional schema-on-read systems cannot handle the query loads of the future. Agentic telemetry requires a 360-degree view, but structuring data only when you read it is too slow for AI-driven workloads. The solution is using LLMs to drive the cost of building parsers to near zero. Tools like Copilot Editor allow teams to map data to OCSF instantly, effectively building factories of parsers to handle the scale of agentic AI.

Microsoft Teams outage - December 10th, 2025

On the morning of December 10, 2025, Microsoft Teams experienced a service disruption affecting users across Australia. Although Microsoft 365 users reported issues across several apps, the hardest hit service was Microsoft Teams which became completely unusable for many organizations. While Microsoft did not acknowledge the incident until 03:46 UTC StatusGator identified the issue at 02:52 UTC through incoming outage reports and delivered an Early Warning Signal at 03:01 UTC.

AI-Powered Observability: From Reactive to Predictive

If there’s one thing clear from our AI-powered observability webinar, it’s that observability has officially graduated from a “nice-to-have” to a business-critical discipline, and AI is helping lead that charge. Our webinar brought together guest speaker Stephen Elliott, Group VP at IDC, and Ranbir Chawla, former SVP of Engineering at RB Global, for an hour of insights that mixed data, experience, and hard-won lessons from the trenches.

Home Assistant Hardware: Requirements and Recommendations

Choosing the proper Home Assistant hardware can be overwhelming. Whether you’re new to home automation or a seasoned pro, the hardware you select can make or break your experience. This comprehensive guide will demystify the requirements, delve into the various options, and help you make an informed decision. From the compact Raspberry Pi to the powerful Intel NUC, we’ve got you covered. So, strap in, and let’s dive into the world of Home Assistant hardware!

How to Build a Clear AI Implementation Strategy

Organizations see AI’s transformative potential, but success requires more than technology – it demands a clear strategy led by IT. A structured AI implementation roadmap aligns initiatives with business goals, establishes governance, and enables measurable ROI, while improving employee and customer experiences. Yet, 66% of organizations view AI as critical, but only 38% report meaningful competitive advantage, highlighting the need for disciplined adoption.

Why Web Synthetic Monitoring essential for Modern Web Performance

Your analytics dashboard is green, which indicates that your application is up 99.9% of the time, pages load in under three seconds on average, and conversion rates are stable. But here’s the uncomfortable reality, you’re probably missing 40% to 60% of the actual performance problems which impact real customers every day.

Bindplane Community Call in December 2025

Join us live on Wednesday, December 10th at 11am EDT for the December Community Call. We’ll cover: Hands-on demos of the new Bindplane features you’ve been asking for Recaps of KubeCon+CloudNativeCon NA in Atlanta New Bindplane feature guides and blog posts As always, we’ll wrap with an interactive Q&A, so bring your questions!

Application Monitoring 101: Decoding Throughput: Understanding the Signals Between Spikes and Drops

Throughput is one of the most foundational metrics in application performance monitoring. It tells you how many requests your app is handling over time and offers a direct look at system load, responsiveness, and scalability. But throughput rarely speaks for itself. The key is knowing how to interpret it, and when to act. In this post, we’ll look at how throughput works in the real world: what healthy looks like, what broken looks like, and what lives in between.

Secure SSL Monitoring Software: A Complete Guide to Safe & Automated Certificate Management

Secure SSL monitoring software has become essential for every business that depends on websites, web applications, APIs, or cloud services. With increasing security threats, expired certificates, and hidden configuration mistakes, companies need reliable tools to ensure their SSL certificates stay valid, updated, and fully compliant. The right monitoring solution helps avoid service outages, failed transactions, and data breaches caused by unmanaged or forgotten certificates.

HTTP API vs REST API vs Web API: Architectures & How to Monitor Them

APIs power everything. From login flows to checkout systems to internal microservice communication. But as teams scale, so does the confusion around the terminology: HTTP API vs REST API vs Web API. Many articles treat these as interchangeable, but the differences are real, and they affect reliability, performance, caching behavior, authentication flows, and ultimately how you monitor your endpoints.

Grafana Labs: Top 10 moments of 2025

For Grafana Labs, 2025 was a year defined by innovation, growth, and the power of our community. We celebrated the release of Grafana 12 at our 10th annual GrafanaCON event, and marked major milestones across open source projects, including Mimir, k6, Beyla, Faro, and Alloy. It was also a year of taking bold steps forward in how teams interact with their systems and data.

How to Monitor VPN Performance for Remote Users

Remote workers depend on VPNs to access corporate resources. When VPN performance tanks, productivity stops. The problem? Most IT teams troubleshoot blindly. They can't tell if slow performance is caused by VPN encryption overhead, ISP issues, or corporate infrastructure problems. Here's the reality: Your remote workers are calling the help desk, saying "the VPN is slow", but you have no visibility into what's actually happening on their end. You're guessing. Maybe you ask them to restart their router.

A better way to monitor your AI agents in .NET apps

We launched agent monitoring earlier this year, allowing our users to instrument LLM usage and tool calls in their applications. However, we only had Agent Monitoring support for Python and JavaScript. We’ve been working on creating an Agent Monitoring SDK for.NET — specifically for Microsoft.Extensions.AI.Abstractions.

This Month in Datadog - December 2025

For our last episode of 2025, we’re focusing on Datadog releases announced at AWS re:Invent. Join Jeremy to see how you can manage logs at petabyte scale in your infrastructure, eliminate unneeded costs in Amazon S3 buckets, build agentic workflows, and detect credential leaks. Later in the episode, Scott spotlights how you can connect your AI agents to Datadog tools and context with our MCP Server.

Highlights from AWS re:Invent 2025: Making sense of applied AI, trust, and going faster

After four days of AWS re:Invent—a 65,000-step marathon that included 60,000 attendees spread across five Las Vegas campuses—and navigating the latest installment of this 13-year-old cloud pilgrimage, we’re all a little dehydrated but significantly wiser. The volume of announcements felt less like a single flood and more like a river branching into three powerful currents. Making sense of this massive technological convergence requires zooming out.

Planning a Smooth Cutover When You Change Critical Business Tools

A cutover is when you change from an old IT system to a new one. With technology advancing at a rapid rate, more and more businesses are learning about and implementing cutovers. If now's the time for you to execute a cutover, it's important that you plan everything effectively so that it goes smoothly. Planning a smooth cutover is easier said than done, however. There are some important things you need to know first. Until you conduct extensive online research, you're never going to be able to effectively plan a cutover.

3 Questions I Expect You to Ask Me

As a product specialist, I’ve had countless conversations about network observability. I’ve seen the good, the bad, and the downright confusing. The market is flooded with vendors, all claiming to have the magic bullet for your network woes. Everywhere I go, the story is the same. The neat and tidy world of the on-premises data center is gone, replaced by a sprawling environment that stretches across multiple clouds, your own facilities, and out to the edge.

Automate Weekly Rollbar Reports with Zapier + Google Sheets

In this video, we cover how you can use Rollbar, Zapier AI, and Google Sheets to create a completely automated reporting pipeline—one that generates weekly reports of Rollbar occurrences, organizes them in Sheets, and arms PMs with insights they can use to guide roadmap decisions, reduce risk, and improve user experience.

Docker Logs Command Reference: tail, follow, since Options

Managing Docker container logs is essential for debugging and monitoring application performance. Tailoring Docker logs allows for real-time insights, quick issue resolution, and optimized performance. This guide focuses on efficient methods for tailing Docker logs, with clear examples and command options to streamline log management.

Observability trends for 2026: Maturity, cost control, and driving business value

The observability landscape has undergone a fundamental transformation over the past several years. In a recent report, The Landscape of Observability in 2026: Balancing Cost and Innovation conducted by Dimensional Research and sponsored by Elastic, over 500 IT decision-makers were surveyed. It revealed that observability has definitively transitioned from an optional capability to a mission-critical business function.

Lightrun 'Runtime Context' Empowers AI Coding Agents to Build Software That Works in the Real World

Safe, Direct Access to Runtime Code Across Staging, Pre-prod and Production via MCP Enables Fundamental Step Forward in Autonomous Software Delivery and Reliability for Enterprises NEW YORK, December 10, 2025 – Lightrun, a leader in software reliability, today launched its new Model Context Protocol (MCP) solution, enabling the industry’s first fully integrated Runtime Context for AI coding agents.

Monitoring Node.js Express Application Performance with AppSignal

As your application scales to serve hundreds, thousands, or even millions of users, understanding its performance becomes essential. Performance monitoring helps you make informed decisions based on data instead of guesswork or user complaints. Imagine users reporting that your app feels"slow". Without proper instrumentation and monitoring, you're left troubleshooting blindly.

SSL Certificate Management: A Complete Guide to Monitoring SSL Expiry, Validity & Certificate Health

Managing SSL certificates is essential for maintaining trust, security, and uptime across any website or online service. While many people think SSL certificate management refers to renewing or issuing certificates, one of the most critical aspects,often overlooked,is monitoring certificates for expiry, validity, and unexpected changes. That’s the area where monitoring platforms provide their highest value.

Automate Weekly Rollbar Reports with Zapier + Google Sheets

Product Managers thrive on clarity. But when it comes to understanding application errors and trends, Rollbar’s rich occurrence data can sometimes feel overwhelming. With AI by Zapier + Google Sheets, you can turn this into a completely automated reporting pipeline—one that generates weekly reports of Rollbar occurrences, organizes them in Sheets, and arms PMs with insights they can use to guide roadmap decisions, reduce risk, and improve user experience.

Sage AI: Dashboard, events, knowledge base

It's starting to take shape. We have a dashboard, we're collecting some metrics, and I'm getting a daily briefing every morning. Also, I have an event log where all the events are going into (the spine of the system), and there's a knowledge base which consists of a GitHub repository which is vectorized and indexed. Its first use is adding context to Herald, the agent that sends me the morning briefing. More details to come.

Prioritizing Bugs with Sentry Logs

Learn how to use Sentry Logs to measure how often a bug occurs and which users it impacts. In this example, a React Native app with an Express.js backend crashes when the diet value becomes undefined. After identifying the root cause, we use Explore Logs to count how many times users switch their diet to “none,” filter the related log messages, and group results by user type to understand the impact.

Kentik in Motion: How AI Transforms Network Chaos to Clarity

Learn how artificial intelligence is transforming network operations through Kentik's AI Advisor platform. Philip Gervasi and Sean McGinley discuss the evolution from traditional network visibility to network intelligence, emphasizing that AI should augment, rather than replace, network engineers. They demonstrate how Kentik's AI Advisor uses natural language interfaces to perform automated root cause analysis, troubleshooting, and cost optimization.

Runtime Context for AI Agents with Lightrun MCP

Introducing Runtime Context for AI agents The next evolution in autonomous software development. The Lightrun MCP connects IDEs and AI assistants to real runtime data, giving agents and developers the context they need to write, validate, and debug code with confidence. With Runtime Context, AI agents can: Reliable, AI-accelerated engineering starts here.

Agentic AI by Design: Evolving Our Principles for the Next Chapter of Responsible AI

Join SolarWinds CISO Tim Brown and CTO Sai Krishna for the SolarWinds Day Closing Keynote, where they share how SolarWinds is evolving from Secure by Design to AI by Design—a bold next step in building trusted, intelligent, and future-ready IT operations. As organizations adopt AI-driven systems, embedding trust, transparency, and accountability into product development becomes essential. In this forward-looking discussion, Tim and Sai reveal how the AI by Design framework ensures responsible AI adoption while enhancing performance, reliability, and security.

Become a 10x investigator with Cribl Notebooks

Cribl Notebooks aims to streamline the investigation process by bringing everything into a single interactive interface. It functions as a virtual war room where teams can collaborate in real time. You can view AI queries and code alongside charts without switching between scattered tabs or workstations. This persistence makes it easier to document the root cause and share the story behind the data.

How Datadog Manages 50,000 Apache Iceberg Tables at Scale

Think managing a few database tables is hard? Try 50,000 production Iceberg tables storing petabytes of data with 8 million scans per day. In this clip, Datadog's platform team reveals the architecture choices behind their managed Iceberg implementation that serves hundreds of internal engineering teams.

Datadog at AWS re:Invent, Bits AI SRE, MCP Server, CloudPrem, and more | This Month in Datadog

Get a closer look at features we announced at AWS re:Invent in the latest episode of This Month in Datadog. Tune in for spotlights of Bits AI SRE, now generally available, and Datadog’s MCP Server, which connects AI agents to our platform by ingesting prompts and mapping them to Datadog resources and data. Plus, we cover how to: This Month in Datadog brings you the latest updates on our newest product features, announcements, resources, and events.

Fixing Performance Issues Fast with Logs & Tracing

Learn how to quickly track down performance bottlenecks using Sentry Logs and Tracing. In this video, we walk through identifying a slow screen, jumping into the connected trace, and pinpointing slow backend steps, database calls, and AI/LLM operations. See how logs, issues, and traces work together to show the full picture of what happened in a single session.

Expose Hidden State Bugs with Sentry Logs

See how Sentry Logs can surface hidden state bugs that stack traces alone can’t explain. In this walkthrough, we debug a React Native app with an Express.js backend where a missing diet value causes a crash. We inspect the issue, pull in the connected logs, and confirm whether the problem comes from an initial render or from real backend data. By combining issues, traces, and logs from the same session, you get the full story—and a faster path to the fix.

Introducing Workspace: Where DEX Work Happens

Today marks another milestone for Nexthink as we introduce a powerful evolution of our platform, one that will meaningfully expand how customers derive value and empower many more teams across IT, HR, and the business to use Infinity. Welcome to Workspace: a new destination where the future of DEX and IT work comes together.

Building a Stronger Defense with Network Observability and Real-Time Monitoring

In today's rapidly evolving digital landscape, the importance of network security and performance has never been more pronounced. Businesses are increasingly relying on their network infrastructure to support a wide array of critical applications, services, and user activities. As cyber threats become more sophisticated and network architectures more complex, maintaining visibility into network performance and security is essential. This is where a network observability platform becomes indispensable.

FinOps Insights for IT Leaders

FinOps insights for IT leaders often focus on cloud spend, but IT leaders know that real cost drivers extend across hybrid environments. Achieving clarity requires more than budget reports. It requires understanding how workloads behave over time, how performance and capacity shift, and where visibility gaps hide operational and financial risk. To support those efforts, we sat down with Tim Conley, creator of Galileo, to explore practical FinOps insights for IT leaders.

How to Track Down the Real Cause of Sudden Latency Spikes

Start with distributed tracing to find which service is slow, then use continuous profiling to see why the code is slow, and finally apply high-cardinality analysis to identify which users or conditions trigger the problem. It's 2 AM. Your phone buzzes. Users are reporting timeouts. The metrics dashboard shows p99 latency spiking from 200ms to 4 seconds, but everything looks normal—CPU at 60%, memory stable, no error spikes. A quick pod restart helps briefly, then latency climbs right back up.

Elastic's move to free on-demand training

Students can now learn what they need within the Elastic stack anytime. The Elastic Training team has shifted its on-demand training strategy from paid to free! Yes, you heard that right — complimentary on-demand training is now readily available to everyone. The Elastic Training team is continuously developing and releasing bite-sized training modules designed to align with Elastic solutions and highlight key features.

Bindplane in 12 Minutes: A Complete Overview of the Telemetry Pipeline for OpenTelemetry at Scale

Bindplane is a unified telemetry pipeline that helps teams cut observability spend by 50% or more. In this overview, you will learn how to route telemetry from any source to any destination, manage large fleets of OpenTelemetry Collectors, and gain real visibility into collector health, state, throughput, and routing behavior. 

How to Check SSL Certificate Expiration Date: Complete Guide to SSL Monitoring

SSL certificates are critical for securing websites, web applications, and APIs. They encrypt data in transit, verify server authenticity, and build user trust. However, SSL certificates have a limited lifespan, typically ranging from 90 days to one year. When a certificate expires, visitors encounter security warnings, some services stop working, and it can affect search engine rankings. Monitoring SSL certificate expiration is essential to maintain secure and uninterrupted online services.

Ultimate Guide to DevOps API Monitoring for Modern SaaS Teams

APIs form the operational backbone of SaaS platforms. They authenticate users, deliver application data, process transactions, and connect multiple services into a cohesive ecosystem. When an API slows down or fails, the impact is immediate: login delays, frozen dashboards, broken customer workflows, and degraded user experience. For DevOps teams, this means monitoring must go far beyond checking status codes.

Part 2: What If Automation Didn't Just Execute Tasks but Earned Our Trust While It Worked?

Every leap forward in technology begins with a question that feels almost human in its curiosity. In this series, we’re examining those questions, the ones that reveal where intelligence meets intention. If data was the foundation of understanding in our first conversation, automation is where that understanding begins to act.

Datadog on Apache Iceberg

Historically, Datadog has relied on technologies like Snowflake and Apache Spark on raw parquet files (lacking consistent table structure) to power internal analytics and data science at scale. As usage grew across product teams, more features depended on data science teams, and our datasets grew to include more telemetry data, these systems became complex to manage and govern both technically and financially. The need for a more flexible and scalable solution led Datadog to adopt Apache Iceberg, an open source table format for data lakes that brings reliability and performance while remaining SQL-friendly.

Configuring the Alerting Plugin in InfluxDB 3

Monitoring starts with data, but action depends on timely alerts. When an alerting workflow relies on scheduled queries or external checks, engineers miss short windows where values shift and conditions form. The alerting plugin closes that gap by evaluating alert rules inside InfluxDB 3 as new values arrive, enabling faster detection and more responsive monitoring.

Bindplane | Notifications

Real-time alerts for your telemetry pipelines are here. In this quick overview, you’ll learn about the new Notifications panel in Bindplane. This update gives you real-time visibility into key changes across your configurations, fleets, and agents so nothing slips through the cracks. You’ll see how Notifications helps you stay ahead of: This new feature centralizes alerts you’d otherwise miss — making Bindplane easier to operate at scale. Email, Slack, and webhook notifications are also on the way.

Keep service ownership up to date with Datadog Teams' GitHub integration

Engineering organizations depend on clear team ownership to maintain reliable services and move quickly. But as codebases expand and teams shift, answering basic questions—Who owns this service? Who should be paged in an incident? Are teams meeting operational standards?—becomes harder.

Web API Sample Endpoints to Practice Monitoring & Testing

APIs rarely fail in isolation. They fail under load, during token refresh, when a dependent service slows down, or when a multi-step workflow breaks halfway through. And yet most engineers still test and monitor APIs using mock endpoints that behave nothing like the real thing.

Monitor One Icinga 2 Cluster From Another

Icinga is designed to be a highly dynamic monitoring software that can monitor your setup, regardless of its architecture. While most setups are hierarchical and fit well into the master, satellites, and agents scheme with different zones, it is sometimes impractical or impossible to create one large Icinga 2 cluster. Imagine that you are responsible for only some hosts within another organization.

HPE OpsRamp Software Named a Major Player in the IDC MarketScape for Worldwide Observability Platforms 2025

Observability platforms help IT teams continuously monitor service health and performance, driving superior service quality and customer experience. Access to deeper diagnostics and actionable insights from observability tools lets IT operators drive scalability, resilience, and service reliability across complex, distributed environments.

M-Dashes, the Cookie Monster & DEX: The BIG Reality Bites 2025 Finale

It’s our favorite Reality Bites tradition: the end-of-year panel! Tom and Tim bring the whole crew together—Megan, Ariana, Sean, and Dina—for a joyful, honest, and insight-packed reflection on 2025. From global travel and AI breakthroughs to personal milestones, hard-won lessons, and the music that carried us through the year, the team shares what defined a transformative moment for DEX, for Nexthink, and for each of us. Expect candid takes on AI balance, ambition, slop, mediation, vibe-coding, human connection—and a full round of “song of the year” picks from the whole panel. A warm, funny, heartfelt wrap to a huge year.

Ep 22: re:Invent recap

In this episode of Masters of Data, we're breaking down AWS re:Invent 2025 through David's eyes (and probably a few cups of conference coffee). We dive into the massive crowds, killer customer conversations, and product demos that actually worked—because we're all about building real tech, not smoke-and-mirrors clickbait. David geeks out over Mobot, our AI tool that's making workflows smoother (not just another chatbot in disguise), and how attendees couldn't get enough of the live demos. We also throw some shade at the AI-washing epidemic and dig into why practical AI applications in security and observability actually matter.

Introducing MetrixInsight for XenServer SCOM Management Pack

Citrix XenServer is increasingly becoming the strategic hypervisor of choice for organizations running Citrix VAD and DaaS workloads. With XenServer Premium Edition now included in Citrix subscriptions, it offers a more aligned, predictable, and cost-effective platform, without compromising on stability, performance, or capabilities. A critical part of enabling that transition is delivering the right level of monitoring and operational control.

Unified network performance monitoring reports for compliance

Compliance audits can be stressful when your performance data and configuration logs live in separate tools. Site24x7 brings everything together in a single view, helping you track every device, configuration, and compliance status in one place. Unified reports make it easy to trace what changed, when it changed, and who changed it—giving you a clear line of sight for every audit and investigation.

Why FedRAMP In Process Matters for Federal Customers

Chris Ebley from Blackwood explains why FedRAMP In Process is a major milestone. It gives federal teams confidence that the product can handle sensitive data, meets strict security controls, and comes from a company committed to operating at the maturity level the government expects. This opens new go to market opportunities and makes it easier for agencies to move forward with Cribl.

Coralogix in G2 Winter 2026: Momentum, Progress, and 192 Badges

As we wrap up 2025 and slowly come down from the re:Invent high, we’ve got one more reason to keep the celebration going. Coralogix has earned 192 badges in the G2 Winter 2026 reports and secured the position in the Momentum Grid Report for Observability Software. It is a strong finish to the year and a clear reflection of the steady progress the platform has been making.

Why should you demand OpAMP support from your vendor?

Fleet management is the practice of monitoring and configuring your fleet of agents and collectors. Key functionality includes: Fleet management is the hallmark of an organisation that has realised the great importance of a healthy telemetry pipeline, and has taken steps to ensure that collectors & agents are every bit as robust as the production architecture for which they are responsible.

Why Cribl Lake Delivers the Best Price Performance for AI Workloads #ai #telemetry

CMO Abby Strong explains how Cribl Lake is built for the real demands of modern AI. You get fast storage for high performance workloads and efficient architecture that scales without blowing up your budget. A smarter foundation for the AI era.

Seeing Everything: Shedding Light on Shadow IT and AI Usage

I still remember the working with a leading insurance provider on an internal review of their IT estate and discovering a team quietly using an unapproved SaaS tool to speed up their reporting. It wasn’t malicious, they were trying to solve a problem faster. But as we stared at the dashboard, I could see the CIO’s mind racing: What data had they uploaded? Was it encrypted? Were they still compliant?

Bindplane in 200 Seconds: Windows Event Logs & Google SecOps

Learn how to configure Bindplane to collect and route Windows Event Logs from a Windows VM into Google SecOps. In this 200 second onboarding walkthrough, Chelsea shows how to build and configure a full SecOps-ready pipeline in just a few minutes. You’ll see how to: Create a Configuration Add the Windows Event Log source Configure the Google SecOps destination Roll out the configuration to an agent running on a Windows VM Start receiving security telemetry inside SecOps.

Using Traces, Metrics, and Logs All in One Place, as Demonstrated by Pipeline Builder

When troubleshooting complex software, it’s important to be able to gain insight via its telemetry quickly and precisely. No one wants to waste time switching between tools or worrying about how to interact with different types of data. At Honeycomb, all your data is available in one place, accessible via our fast query engine. But what does that look like in practice?

Meet Web Vitals Performance Issues

We’ve introduced a new type of Performance Issues, Web Vitals Performance Issues. These issues will be opened for the highest opportunity pages in your application if your Web Vitals metrics drop into our meh, or poor thresholds for performance. We’ve built these issues with Seer Issue Fix specifically in mind. Our goal is to not just alert you about low vitals scores, we want to give you actionable steps you can take to improve your scores and, when possible, fix the problem for you.

Bindplane Onboarding | Install Your First OTel Collector & Send Windows Events to Google SecOps

In this 10-minute step-by-step walkthrough, Chelsea from the Bindplane Customer Success team shows you how to install your first Bindplane OpenTelemetry Collector and start sending Windows Event telemetry from a Windows VM directly into Google SecOps.

Solve bandwidth issues quickly with NetFlow reports

Gain complete visibility into your bandwidth usage with network traffic monitoring reports in Site24x7. In this video, we walk you through the key reports that turn raw traffic data into actionable insights—helping you troubleshoot issues faster, optimize bandwidth, and strengthen security. You'll learn: With these reports, you’ll always know what’s happening on your network—and how to respond before minor issues escalate.

AI-Driven Database Monitoring for Modern IT Teams | Site24x7

Databases power every business, but keeping them fast, reliable, and scalable is a daily challenge for IT teams. Discover how intelligent database monitoring helps you uncover performance bottlenecks, optimize queries, and maintain database health effortlessly. Whether you manage SQL or NoSQL systems, gain actionable insights across your infrastructure before issues affect your applications or users.

Transaction Check Basics in less than 3 minutes

In this video, we explore the basics of Transaction Checks on Uptime.com, an advanced multi-step monitoring tool for website elements. Learn how to create customized scripts to mimic user actions such as visiting a site, filling out forms, and clicking buttons. We walk through a step-by-step guide on setting up a Transaction Check to monitor a login process, including navigating to a URL, validating HTTP status codes, and using browser developer tools to configure field entries. Discover different monitoring intervals and tips for organizing your checks with tags and location settings.

Cloudflare was down again: Here's what happened.

On December 5, 2025, the internet faced another major disruption – the second significant Cloudflare-related outage in just a few weeks. A similar widespread incident occurred on November 18, which we covered in detail in our post The internet broke again – StatusGator can help. Today’s outage reinforces how quickly issues within core internet infrastructure can ripple outward and impact thousands of services simultaneously.

What Services Are Not Downdetector Alternatives - And Why StatusGator Actually Is

Search for Downdetector alternatives on Google, ask ChatGPT or any AI assistant, and you’ll usually get a list of tools like Datadog, Site24x7, New Relic, Atera, and other monitoring platforms. There’s just one problem: The AI-generated answers continue to lump these monitoring tools together, creating confusion for IT teams and muddying the category. This article exists to set the record straight.

Towards a more resilient StatusGator

Between October 20 and December 5, 2025, a rapid succession of major outages across multiple cloud providers disrupted large portions of the internet. Each of these events affected StatusGator in different ways. After each incident, we implemented improvements to strengthen our reliability. This post summarizes the impact of each outage, the changes made, and the architectural work now underway to ensure StatusGator remains available during the moments when it is needed most.

Which Observability Tool Helps with Visibility Without Overspend

If you’re trying to control observability spend without cutting visibility, the platforms that usually offer the best cost balance at enterprise scale are Last9, Grafana Cloud, Elastic, and Chronosphere — depending on the shape of your telemetry and the level of operational ownership you want.

Rollbar + Zapier AI: Automatically Generate Clear, Actionable Jira Tickets

How do you turn raw error payloads into clean, meaningful ticket summaries without touching a line of code? Engineering teams rely on fast, accurate error context to resolve issues efficiently. Rollbar does a great job capturing rich payload data at the moment an error occurs, but getting that data into your issue-tracking workflow can still require manual triage—especially if you want clean, human-readable summaries in Jira.

Rollbar + Zapier AI: Automatically Generate Clear, Actionable Jira Tickets

How do you turn raw error payloads into clean, meaningful ticket summaries without touching a line of code? Engineering teams rely on fast, accurate error context to resolve issues efficiently. Rollbar does a great job capturing rich payload data at the moment an error occurs, but getting that data into your issue-tracking workflow can still require manual triage—especially if you want clean, human-readable summaries in Jira.

AI Agents Need Structured Telemetry. Are You Preparing? #telemetry #ai

Clint Sharp breaks down the shift from traditional observability to AI ready telemetry. Agents need well formed fields, consistent schemas, and predictable data models. If your environment is full of unstructured logs, agents will give inconsistent answers. The work starts now so your AI future can actually deliver value later.

Browser Monitoring Software: A Complete Buyer's Guide for Modern Web Applications

Modern web applications rely on complex front-end frameworks, APIs, and third-party services to deliver seamless user experiences. Even minor performance issues—slow load times, broken workflows, or browser-specific errors—can lead to lost conversions, frustrated users, and reputational damage. Browser monitoring software provides IT teams, developers, and business stakeholders with visibility into application performance from the end-user perspective.

AI Is Growing Your Data Faster Than Your Budget #telemetry #ai

Clint Sharp explains why data is growing at a 30% CAGR while budgets stay flat. Teams are already running infrastructure at 80 to 90% capacity, and AI agents multiply query volume by ten or fifty. What got you to 2025 will not get you to 2035. You need a new approach to handle AI scale without blowing up cost.

Monitoring Client-Side Routing Frameworks: SPA, CSR & Hybrid

Modern web applications have shifted their center of gravity. The page is no longer the system— the runtime is. Frameworks like React, Angular, Vue, Next.js, SvelteKit, Remix, and Nuxt treat HTML as a bootloader, and the real application emerges only after hydration, routing, data fetching, and continual re-rendering. What users experience depends entirely on JavaScript execution, not static markup. Teams usually discover this shift when the UI appears to load but nothing works.

Making Sense of Complex Data in Observability Tools

Metrics, analytics, measurements, and parameters – can we truly see these abstractions? Data visualization helps us do just that, bridging the gap between raw information and human comprehension. Visualizing data is like rafting down a river – dynamic, unpredictable, and full of discoveries along the way. In this guide, we’ll explore how to craft visualizations that inform, engage, and inspire. So, grab your paddle and hop aboard!

Visualising Sentry analytics with SquaredUp

Sentry is a mature observability product with SDKs supporting nearly every major programming language. It has expert knowledge of each coding stack and is therefore capable of offering rich insights with a minimum of initialisation required by the developer. You don’t need to set up OpenTelemetry collectors or wrestle with endpoint configurations; simply drop the SDK initialisation into your application start-up process and telemetry begins flowing into the Sentry backend.

New Vehicle Monitoring Capabilities In The Works For 2026

Technology for keeping track of vehicles is advancing and companies are gaining more control and oversight. But the project isn't yet complete. There's still room to improve. In 2026, we expect all sorts of new advancements to take center stage in the business world. These will offer managers new capabilities and allow them to really increase productivity to levels they never imagined.
Sponsored Post

IT Ops vs DevOps: Same Goal, Different Mindset

The debate around IT Ops vs DevOps often creates confusion about whether these are competing approaches or complementary ones. While both aim to deliver reliable, efficient technology services, they approach this goal from fundamentally different perspectives. Understanding these differences helps organizations build stronger technology teams and choose the right operational model.

Key Metrics Your Browser Monitoring Software Should Track

Modern web applications rely on seamless user experiences, fast load times, and reliable performance across every device and region. Browser monitoring tools make these features possible by tracking how real web browsers interact with your site revealing issues long before users notice them. To ensure your monitoring setup captures everything that matters, here are the five essential metrics every browser monitoring solution must track.

Why Remote Work Just Works - Hear It From Our Grafanistas

Several Grafanistas talk about their remote work experience at Grafana Labs. Being remote-first enables our team to be based where they feel most productive and to ensure that work and life aren't in competition. And remote-first is *not* remote only. Grafanistas enjoy the opportunity to come together during team offsites or in shared co-working spaces. Connection is important.

7 Senior-Level AI Debugging Tools Compared

Every dollar spent on engineering is a bet on the future. But look at your engineering team's sprint backlog and you’ll see a non-trivial amount of that capital is spent on repairing the past. For the last ten years, if you asked a VP of Engineering what the solution was, the answer was always the same: better monitoring. Throw more telemetry at the wall. Build a bigger dashboard. Send more alerts at 3 AM. It was the only available tool, so it became the entire thesis.

Explaining Icinga Director for Practitioners Webinar Recording

Starting from a clean installation, we will guide you through the complete setup process and create a first monitoring configuration together. You will learn how to navigate the Icinga Director interface, discover its main features, and see how automation can simplify your daily work through data imports and synchronization rules. You'll learn: Resources: Some more questions from the FAQ section, we want to answer.

OTel Updates: Unroll Processor Now in Collector Contrib

Some log sources bundle multiple events into a single record before shipping them. This is common with VPC flow logs, CloudWatch exports, and certain Windows endpoint collectors. While this batching approach is efficient for transport, it creates challenges when you need to filter, search, or correlate individual events. When a log record contains an array of 47 events, your analytics tool sees one entry instead of 47 distinct records.

Understanding How a Log Correlation Engine Enables Real-Time Insights

Tax season is notoriously most people’s least favorite time of year. For people who complete their own tax returns, the process becomes an agonizing one of looking at small pieces of paper, matching numbers to the lines that ask for information, and comparing various inputs. In essence, doing your taxes makes you a correlation engine. Now, imagine taking this tedious process and applying it to the terabytes of data that your environment generates daily.

Send OpenTelemetry traces and logs from Cloudflare Workers to Grafana Cloud

Cloudflare Workers is a developer platform for deploying serverless functions, frontends, containers, and databases to a global network, spanning 330+ cities around the world. However, as your application scales, it becomes crucial to have the right observability tools to investigate issues, monitor performance, and get alerts when issues arise. Last month, Cloudflare Workers announced support for exporting OpenTelemetry logs and traces, letting you send this data directly to Grafana Cloud.

Use Database Monitoring in Splunk Observability Cloud to Identify and Resolve Slow Queries

In this video, I introduce Database Monitoring in Splunk Observability Cloud. I'll demonstrate how to spot and resolve slow queries by leveraging rich metrics and correlating database performance directly with traces in Splunk Observability Cloud APM. TOC.

A Week of Insight, Connection, and Innovation at Gartner IT Symposium/Xpo in Orlando

Gartner IT Symposium/Xpo is always a standout experience for ScienceLogic, and this year’s event in Orlando was no exception. The event brought together seasoned IT leaders, analysts, and solution providers, creating a dynamic hub for meaningful conversations, hands-on demos, and translating future-driven insights into action. More than being honored to attend, ScienceLogic thrives on engaging with IT leaders on the show floor, in sessions, and throughout the event.

Automate infrastructure operations with Datadog Infrastructure Management

Many organizations struggle to track how their cloud infrastructure changes over time. Modern environments span tens of thousands of resources across hundreds of accounts and multiple clouds. Application teams add new services and regions at a rapid pace, increasing the number and variety of resources that need to be managed. These shifts can cause infrastructure configurations to drift from a well-architected state, increasing the risk of service reliability issues and unexpected cloud spend.
Sponsored Post

Adding a CDN to a load balancer (for a much faster website)

Here at Raygun, we like to go fast. Really fast. That's what we do! When we see something that isn't zooming, we try to figure out how to make it go faster. So today, we're answering a simple (and relevant) question; how do we make our public site, raygun.com, much, much faster? The answer, at first glance, is simple-we build it into a Content Delivery Network (CDN). But what if you have a load balancer serving your website, and you don't want to rebuild everything to serve from a CDN? Well, that's more complicated. Let's start by describing the issue.

Shopify Cyber Monday outage - December 1, 2025

On December 1, 2025, Cyber Monday, the biggest online shopping day of the year, Shopify suffered a widespread outage that left many merchants unable to access their stores or process orders. At a time when every minute of uptime translates directly into revenue, the disruption caused immediate concern across the ecommerce community. StatusGator detected the issue within minutes, sending an Early Warning Signal 10 minutes before Shopify published its official acknowledgement.

How Browser Monitoring Tools Improve Application Reliability and End-User Experience

Browser monitoring tools, also known as Real User Monitoring (RUM) solutions, enhance application reliability and end-user experience by providing detailed, real-time visibility into how users interact with web applications. These tools track key performance metrics, identify front-end errors, and help development and DevOps teams detect and resolve issues that directly impact users before they escalate.

What the Octopus Can Teach Us About AI (w/ Steve Wunker)

Tim and Tom sit down with Steve Wunker — Managing Director of New Markets Advisors, author, and early pioneer of the smartphone — to explore the big ideas behind his latest book, AI and the Octopus Organization. Steve breaks down why AI shouldn’t just “bolt onto” old processes, how distributed intelligence reshapes the firm, and what leaders can learn from one of nature’s most adaptable creatures. From organizational plasticity to the changing role of middle managers, Steve offers a pragmatic roadmap for thriving amid rapid AI-driven transformation.

Shift Happens: How to Make Your ITSM Incidentally Awesome

A modern service desk goes far beyond basic ticketing, serving as the central engine for IT operations. This THWACKcamp session from SolarWinds Day reveals how to streamline and standardize ITSM workflows, transforming the service desk into a strategic asset that eliminates administrative headaches. SolarWinds Sr. PMM Lauren Okruch and THWACK MVP Jeremy Mayfield, Director of IT at National Sugar Marketing, explore how modern service desks go beyond ticketing to become the hub of IT operations.

Drowning in Alert Fatigue? How to Regain Control of Your Monitoring

If you’ve ever muted your phone during a maintenance window, only to miss a real outage an hour later, you’re not alone. Sysadmins on Reddit and beyond often describe feeling like they’re drowning in alerts: So many notifications that the important ones lose their meaning. This is alert fatigue, sometimes called notification fatigue or incident noise, and it’s one of the most common challenges in modern, growing IT operations.

Cribl and Cloudflare give you full network visibility with real time telemetry

Glenn Block explains how the new Cloudflare source and R2 destination in Cribl Stream lets you ingest WAF, DNS, and Zero Trust logs for full visibility and real time intelligence. Better security, better performance, and lower cost for modern IT and security teams.

The Performance Revolution in JavaScript Tooling

Over the last couple of years, we've witnessed a remarkable shift in the JavaScript ecosystem, as many popular developer tools have been rewritten in systems programming languages like Rust, Go, and Zig. This transition has delivered dramatic performance improvements and other innovations that are reshaping how developers build JavaScript-backed applications.

5 Network Issues That Affect Remote Offices (Not HQ)

Your headquarters runs flawlessly. Zero network complaints. But your remote offices? Constant connectivity problems, dropped video calls, and frustrated employees filing help desk tickets you can't solve. Remote offices experience 3x more network issues than headquarters, yet most of the IT teams have zero visibility into what's actually failing.

kubectl logs Command Reference and Documentation

The kubectl logs command retrieves container logs from Kubernetes pods. It supports real-time log streaming with -f, time-based filtering with --since, viewing previous container instances with --previous, and accessing logs from specific containers in multi-container pods using -c.

What's new in the Grafana Image Renderer: higher-quality results, security enhancements, and more

Whether it’s for an email or that upcoming presentation, many Grafana users like to share their favorite dashboards or panels outside of Grafana itself. The Grafana Image Renderer is a backend service for Grafana that helps you do just that by rendering panels and dashboards as images, such as PNGs and PDFs, via a headless browser. It’s commonly used to support Grafana features like exporting dashboards, generating images for alert notifications, and creating PDF reports.

You've Found the Waste In Your Network Operations. Now What?

In a previous blog, we looked at your network operations through the lens of lean principles. We exposed the seven wastes that quietly drain your budget and burn out your teams. This constant cycle of reactive firefighting comes with a steep price. We outlined a concept in quality management known as the Cost of Poor Quality (COPQ), the total financial impact of wasted engineering hours, lost user productivity, and business risk.

9 Third-Party Risk Monitoring Tools That Actually Cut Vendor Assessment Time

Nearly one in three cyber breaches now start with a supplier, McKinsey found in 2024. A single vendor review cycle often spans 3 to 5 weeks due to manual evidence chasing, according to Forrester's 2024 State of Third-Party Risk Report. And a May 2025 Gartner brief warns that this "perfect storm" of attacks, supply-chain shocks and new regulations is forcing boards to modernize third-party risk-fast.

Using the Downsampling Plugin in InfluxDB 3

Modern systems generate huge volumes of time series data. Advances in hardware and edge instrumentation enable sensors and applications to capture new values every second—or faster—which makes high-frequency measurement easy and affordable. When applied effectively, this steady flow of data reveals early warning signs, highlights subtle performance shifts, and helps teams understand how systems behave in real-time.

Part 1: What If Data Wasn't Just the Fuel for AI but the Foundation of Everything It Knows?

Every breakthrough begins with a question. What if we looked beyond today’s tools, buzzwords, and hype and examined the design principles shaping tomorrow’s intelligent enterprises? The What If series explores those inflection points: moments where technology meets human judgment, where automation meets accountability, and where AI begins to resemble something more like understanding than output.

Better Together: Building the Self-Healing Enterprise

When technology slows, everything does. Guests wait to check in. Travelers queue at kiosks. Shoppers refresh the page, hoping the payment goes through. Every second of downtime costs companies millions and frustrates millions more. LogicMonitor and Catchpoint have been solving that problem from different sides: one focused on the systems and infrastructure that keep businesses running, the other on the experiences and performance that users actually feel.

Observability in the AI age: Datadog's approach

Ten years ago, Datadog was a single-product company focused on breaking down the silos between dev and ops. As the shift towards the cloud accelerated and organizations transitioned to the new DevOps model, we set out to develop an observability platform that would enable these teams to safely scale faster and answer the essential questions about their services: are they available, secure, compliant, performant, and cost-efficient?

A New Chapter: LogicMonitor + Catchpoint - A Personal Note from Mehdi

In 2008, I was sitting in my garage office with a simple but stubborn idea: the Internet deserved better. End users deserved better. Companies needed a way to truly understand what their customers were experiencing, not just what their servers were reporting. Digital Experience Monitoring wasn’t a category yet. But the need was unmistakable. That idea didn’t come from theory or ambition. It came from lived experiences.

Optimize Kubernetes cluster cost with Datadog Cluster Autoscaler

Running Kubernetes at scale almost always means paying for more compute than you need. To protect reliability, platform and application teams typically overprovision nodes early in development and keep scaling up as they add features and workloads. They are often reluctant to move to smaller or different instance types without a clear picture of how those changes will affect performance or availability. The result is a fleet of underutilized nodes that silently inflate your cloud bill.

Top Browser Monitoring Features Every DevOps Team Should Prioritize in 2026

In 2026, digital performance is more critical than ever. Users expect web applications to load instantly, respond flawlessly, and support complex interactions without delay. For DevOps teams, this means browser monitoring is no longer optional—it’s a foundational capability for ensuring availability, speed, and reliability across modern web experiences.

Patterns for Deploying OpenTelemetry Collector at Scale

So, you've embraced OpenTelemetry, and it's been great. Pat, Pat. That single, vendor-neutral pipeline for your traces, metrics, and logs felt like the future. But now, the future is getting bigger. That simple OTel Collector configuration that worked perfectly for a few services is starting to show its limits as you scale. The data volume is climbing, reliability is becoming a concern, and you're wondering if that single collector instance is now a bottleneck waiting to happen.

Datadog Bits AI SRE: Your new teammate for on-call shifts

Bits AI SRE is an always-on SRE agent built to handle complex troubleshooting and late-night alerts. Developed against thousands of real-world incidents and powered by Datadog’s platform, Bits AI SRE analyzes your entire stack, tests hypotheses, and identifies root causes in minutes. Resolve faster, get back to sleep sooner, and give your on-call team the confidence and capacity they need.

Optimize Your Oracle Cloud (OCI) Spend with Datadog Cloud Cost Management

Support for Oracle Cloud Infrastructure (OCI) is now live in Datadog Cloud Cost Management. In this short demo, you’ll learn how to: Get granular visibility into OCI cost and usage—by service, compartment, tag, and resource tier. Uncover savings opportunities by combining cost data with observability metrics like CPU, memory, and storage utilization. Set up anomaly monitors and budgets to avoid cost overruns—especially for high-risk workloads like AI and GPU training.

Contextual, in-product guidance for every Grafana user: A closer look at Interactive Learning

As developer advocates at Grafana Labs, we’re always looking for new ways to help our users better understand and learn observability. You might remember our previous project that brought learning to life through an adventure-style game, and now we’re really excited to share something else we’ve been working on: Interactive Learning, a new way to get the technical help you need directly in Grafana.

New Feature: Filter HTTP Pings by Keywords

Healthchecks.io can now classify HTTP pings from clients as start, success, or failure signals not only by URL suffixes (no suffix, /start, /fail, /{exit-status}) but also by looking for specific keywords or phrases in the HTTP request body. The content filtering feature was already available for email pings, and now it has been extended to HTTP pings as well.

European enterprises prioritise governance in AI deployments, as North America accelerates towards full autonomy

Digitate report reveals differing approaches to AI deployment between Europe and North America, but ROI remains consistent. Europe leading on governance while NA organisations show faster progress towards autonomous operations.

November 2025 - Early Warning Signals

November brought a steady flow of service disruptions across productivity, finance, developer tools, and major consumer platforms. Two incidents stood out as the month’s most significant: a major Google Workspace outage on November 12 affecting Docs and Sheets globally, and a widespread Cloudflare issue on November 18 that caused cascading failures across multiple services.

Introducing our new service monitor APIs

We’re pleased to announce new enhancements to the StatusGator API platform that make it easier to automate how you monitor third-party services. The new Service Search, Create Service Monitor, and Update Service Monitor endpoints give developers more control over how monitors are created, labeled, and maintained across projects and environments. These APIs are designed for teams that integrate StatusGator into their deployment processes, internal tooling, or infrastructure automation.

New roadmap & feature request hub

We’re excited to announce that StatusGator has officially moved to a new platform for collecting feature requests, organizing our roadmap, and keeping you updated on what we’re building. This new system makes it easier than ever to share ideas, vote on improvements, and follow the progress of the features that matter most to you.

Incident IQ: Outage announcement bar

Our Incident IQ integration just got better. Meet the Outage Announcement Bar, a simple way to surface live outage details inside Incident IQ. This new feature makes it even easier for users, support teams, and administrators to stay aware of service disruptions the moment they happen. This update builds on our existing Incident IQ integration, which already syncs outage reports from your StatusGator status page into Incident IQ.

Why AI Will Push #Telemetry Budgets to the Breaking Point in 2026

Telemetry growth is about to hit a new level in 2026. Nick Heudecker from Cribl walks through our new predictions report and explains why observability costs are set to surge again, with more than a third of enterprises spending at least 15 % of their IT budgets on telemetry alone. He also shares how agentic AI adds new risk to the data pipeline, why most AI workloads will struggle to scale, and how platform shifts and market forces will reshape the data landscape.

#AI Powered Data Protection Inside Cribl Guard

Cribl Guard uses an always running AI agent to spot sensitive data as it moves through your environment and recommend the right protections in real time. In this demo, you will see how the agent samples live events, identifies patterns like credentials and credit cards, and turns them into one click fixes that keep your destinations safe. Faster detection, smarter rule recommendations, and instant mitigation. This is what modern data protection looks like.

New agents in the Dojo: Expanded Sumo Logic Dojo AI

Back in September, we unveiled Sumo Logic Dojo AI, our agentic AI platform built to power intelligent security operations and incident response. With that launch, we introduced Mobot, our conversational interface, as well as our first agents designed to help automate routine tasks, streamline investigations, and give security teams the freedom and ability to focus on analyzing the highest value security issues facing their organization. Today, we’re excited to share the latest additions to Dojo AI.

Ep 20: re:Invent FOMO? Dojo AI demo

Not heading to re:Invent this week? Don't worry—we've got you covered. In this episode, we welcome Architect Solutions Engineer, Jake Lee, to preview the exciting new Sumo Logic tools we are showcasing in Vegas. Our new SOC analyst agent acts as an AI partner that instantly assesses incident severity and recommends next steps—no more drowning in alerts. The MCP server breaks down barriers by letting you query Sumo Logic from Slack or integrate security insights directly into your IDE.

Design as Infrastructure

SaaS products that are built for engineers power critical workflows, yet their designs are often afterthoughts. SaaS products often assume that technical audiences will figure out their way through a complex experience, or just forgive them for the paper cuts on the way. A foundational design system can be perceived as a layer of polish rather than an infrastructure investment, especially in the early stages of a startup.

Monitoring Azure Metrics to Protect Uptime And Stop Threats Early

This is the fifth blog in our Azure Monitoring series, and we’re focusing on what’s most critical: keeping your environment secure and always available. Performance and cost mean nothing if your services go offline or your data is compromised. In this post, we’ll highlight the Azure metrics that help CloudOps teams detect threats early, build resilience into their stack, and stay ahead of outages before they impact users or compliance. Missed our earlier posts? Catch up.

Monitor Everything is an Anti-Pattern!

Bullshit and nonsense. But let’s take it from the beginning. The industry’s story goes something like this: Then, in the same breath: You see the contradiction already, right? The same industry that tells you “collect less, simplify, trust the experts” is also the industry where: This isn’t an observability strategy. It’s observability by hindsight. Right. Good. Now we’re having fun.

Here's What a Network Needs After a Cloud Migration

By now, most organizations have realized the benefits of moving some, most, or all of their business applications to the cloud. The cloud typically offers better security and performance, at a lower price, than housing resources on-premises. You may have helped them in that migration or you may have been hired after it was complete. Either way, a client with cloud hosting has different network requirements than one whose infrastructure is primarily on-premises.

Configuring an Internet Connection for a Cloud-Hosted Environment

Part 2 in our series on Here’s What a Network Needs After a Cloud Migration. Part 1 looked at how to redesign the LAN. When a company’s application infrastructure moves to the cloud, a reliable Internet connection becomes mandatory. Hiccups in Internet service that might have been an inconvenience when apps were in-house now grind the business to a halt. Unfortunately, the Internet link happens to be the single least reliable element in an IT infrastructure.

Managing User Access & Authentication in a Cloud-Hosted Environment

This is the third and final instalment in a series on Here’s What a Network Needs After a Cloud Migration. Part 1 looked at how to redesign the LAN. Part 2 outlined strategies for the Internet connection. One of the things that becomes more important in a cloud-based application environment is managing user access and authentication.

Stop the Insanity! Quit Doing These 7 Manual Network Management Tasks

Active network infrastructure management is a key element of any managed service offering. Traditionally, network management has involved a lot of tedious manual work, making it expensive and very hard to scale. And that’s why many MSPs have shied away from actively managing the network. But not managing network infrastructure at all is a risk to your business. Your clients likely expect you’re looking after the network whether you’ve promised it or not.

How Browser Monitoring Tools Enhance Application Reliability & User Experience

Modern web applications are increasingly complex, with dynamic content, single-page apps (SPAs), APIs, and third-party integrations. For businesses, ensuring application reliability and a seamless end-user experience is critical. Poor performance can lead to customer dissatisfaction, revenue loss, and reputational damage. This is where browser monitoring tools and browser performance monitoring come into play.

How to Fix Cyclic Inheritance Errors in Icinga Director during Object Configuration

Icinga Director is a powerful tool that greatly simplifies the configuration, management, and deployment of monitoring objects in Icinga. It provides a user-friendly interface and automation features that make complex setups easier to maintain. Occasionally, though, you may unintentionally introduce a cyclic inheritance while configuring templates. A typical case occurs when a template imports another template that eventually imports the original one again.

9 Monitoring Tools That Deliver AI-Native Anomaly Detection

The observability market has moved beyond manual threshold-setting. Modern platforms use statistical algorithms, machine learning, and causal AI to detect anomalies automatically. Some work immediately after deployment. Others train on your data for better accuracy. Each approach has technical trade-offs worth understanding. This guide compares how nine monitoring solutions handle automated anomaly detection and root cause analysis.

Grafana Service Center: Simplify Service Reliability in One Place

Grafana Service Center gives engineers and stakeholders a single place to ensure service reliability. In this video, Staff Product Manager Ryan Kehoe walks through how Service Center ties together alerts, SLOs, dashboards, incidents, and metadata for each service. Learn how to centralize reviews, speed up investigations, and improve visibility across your teams—all within Grafana Cloud.

All Is Calm, All Is Compliant: Staying Audit-Ready Through the Year-End Rush

As the year winds down, I find that most cybersecurity and compliance teams are focused on closing projects, hitting targets, and maybe even planning a well-earned break. But regulators? They don’t take holidays. FCA, PRA, GDPR – they remain vigilant, and so should you. For IT leaders, this season often feels like walking a tightrope: balancing operational demands with the relentless need for compliance.

Honeycomb Frontend Observability - See Everything

Chapters: In this video we take a tour through Honeycomb's Frontend Observability offerings for Web and Mobile. We see how the launchpads can help spot performance errors, how errors that occur in the frontend can be traced all the way to their cause in other backend services easily with the error investigations feature, and how easy it is to find differences between traces across various devices.

How To Migrate Away From DogStatsD Using Telegraf

Datadog is a popular monitoring platform, and one of its key components is DogStatsD which is a customized extension of the original open-source StatsD protocol. DogStatsD adds powerful features like tagging, histograms, and distributions, but it also introduces vendor lock-in. This is because DogStatsD metrics follow a specific wire format that many other monitoring platforms do not natively support.

How to Write a Cover Letter That Actually Helps You Get the Job

Cover letters are supposed to help you shine, but most of them blur together into the same polite, forgettable paragraphs. The intention is good (“I want them to notice me!”), but the execution… not so much. So, here’s a simple, honest guide to writing a cover letter that actually works, especially if you’re applying to Checkly. Spoiler: shorter is better. And authenticity in this AI era is better than perfect polished perfection.

Improve service reliability and ops culture with Grafana Cloud Service Center

Today’s engineering organizations are built around service ownership. Service owners are accountable for keeping their services reliable, performant, and ready to scale. But no service operates in isolation; every team depends on others, and those dependencies form a complex web that can be hard to see, let alone understand. To truly deliver reliable systems, you need visibility not only into how your own service performs, but also how it affects others.

Monitor Claude Code adoption in your organization with Datadog's AI Agents Console

AI coding assistants are quickly becoming a core part of software engineering workflows, helping developers write, refactor, and review code faster. But without effective monitoring, it can be difficult to know whether these tools are performing reliably and proving useful to engineers. As organizations scale their use of tools like Claude Code, key questions emerge.

Our latest updates across the VictoriaMetrics Observability ecosystem

We’re excited to announce a set of updates across the entire VictoriaMetrics open source products suite — including VictoriaMetrics, VictoriaLogs, VictoriaTraces, the VictoriaMetrics Kubernetes Operator. These improvements bring better performance, stronger security, enhanced metadata visibility, and a smoother experience when running observability at scale.

AI Agent for Business SLA Predictions: Safeguarding Business Continuity with Predictive Intelligence

Modern business functions are based on the promise of smooth and seamless experience, without the need for downtime or long waits for backend processes to finish. For such digital operations, timely execution of business processes—like financial closings, order fulfilment, report generation—is non-negotiable.

Accelerate investigations with AI-powered log parsing

When debugging production issues, investigating security incidents, or analyzing network traffic, engineers and analysts need not only to find the right logs but to make sense of all the dense, unstructured data generated by different systems. Logs rarely ship neatly laid out in a way that facilitates filtering, faceting, or graphing for every possible scenario. As a result, teams often find themselves writing regular expressions or custom parsers on the fly, which can be error-prone and time-consuming.