Operations | Monitoring | ITSM | DevOps | Cloud

Migrate to SCOM 2025: A Seamless Transition for Enhanced Monitoring

Are you ready for the next evolution of System Center Operations Manager (SCOM)? Microsoft launched SCOM 2025 in November last year, bringing new enhancements and improved capabilities. To help you navigate the transition smoothly, we’re hosting an exclusive webinar where our experts will walk you through the migration process, best practices, and new feature highlights. What’s in Store?
Sponsored Post

How to Quickly Analyze CloudFront Cloud Logs in Amazon S3

Content delivery networks (CDNs) such as Amazon CloudFront generate a flood of log files. In today's world where your customers are all around the globe, it's important to make sure that your websites' application assets are as close to the users as possible.
Sponsored Post

Monitoring Cloud Foundry in SAP Business Technology Platform (BTP)

Cloud Foundry is possibly the most popular environment on SAP Business Technology Platform. When customers build applications with the SAP Cloud Application Programming (CAP) framework to extend SAP S/4HANA solutions and achieve a clean core, they typically deploy using Cloud Foundry. After the applications on Cloud Foundry go into productive use, they become business critical and that creates a need for observability in those applications and the platform. Monitoring of Cloud Foundry is now an essential requirement of SAP operations teams.

The 8 Hidden Pitfalls of Using AWS CloudWatch

AWS CloudWatch is a widely used observability tool that comes built into AWS. It provides easy access to logs, metrics, and alarms, making it a convenient choice for teams monitoring AWS workloads. But while CloudWatch offers a lot of power, many teams unknowingly misconfigure or misuse it, leading to unexpected costs, limited visibility, and operational challenges. Here are some common pitfalls we see—and how to avoid them.

Moving to VDI? Don't Forget Your Web Apps

I recently spoke to one of our Customers in Financial Services, who offer financial services through a network of Agents located across the United States. The agents are customer-facing and revenue generating. They rely on a variety of browser-based applications to deliver services to their clients – making these applications mission-critical.

StatusGator now monitors 5,000 services - and growing!

We’re thrilled to announce a major milestone: StatusGator now monitors more than 5,000 services! Whether you rely on cloud platforms, SaaS tools, developer APIs, or infrastructure providers, we’ve got you covered. Our extensive service coverage means you can track the status of all your critical dependencies in one place, reducing downtime surprises and keeping your team informed.

10 Reasons Why Tech Companies Need StatusGator

Reliance on cloud services for infrastructure, collaboration, and seamless operations continues to grow. With organizations investing heavily in cloud infrastructure, Statista projects that the global public cloud computing market will reach $127 billion. As a result monitoring solutions have become indispensable. A well-monitored cloud environment helps reduce downtime, prevent revenue loss, and ensure smooth business operations.

Comparing Go vs Ruby

Ruby and Rails are great tools that allow you to create complex web applications quickly. Well, some kinds of complex web applications. While they excel at traditional, monolithic, server-rendered applications, they fail to excel at delivering real-time or distributed services. This is why it's so handy for Rubyists to learn a programming language like Go. Go is designed to write lightweight services that handle lots of inbound connections.

Top B2C eCommerce Strategies in 2025: What's Actually Working

ECommerce is a mess right now. Luxury platforms are crashing. Social commerce is booming (but probably not for long). CAC is through the roof. And somehow, despite all this, brands still need to find a way to stand out, sell, and make money. If you’re running an eCommerce brand in 2025, here’s what’s actually working—and what’s just hype.

Prometheus Functions: How to Make the Most of Your Metrics

Keeping track of your infrastructure is non-negotiable. Prometheus makes that easier by collecting metrics and alerting you when something’s off. It’s a powerful tool that helps you understand what’s happening under the hood, whether you’re running a small cluster or managing large-scale applications. In this guide, we’ll break down Prometheus functions—what they do, how they work, and why they matter for better observability. Let’s get into it.

CloudFront on AWS: Basics & Setup Guide

Some websites load in a snap, while others make you wonder if the internet is broken. The difference? Often, it comes down to how (and where) their content is served. A Content Delivery Network (CDN) helps by storing copies of your content in multiple locations worldwide, so users don’t have to wait for a distant server to respond. If you're on AWS, CloudFront is the built-in way to do this—helping speed things up while also handling security and traffic optimization.

How to implement multi-window, multi-burn-rate alerts with Grafana Cloud

Andrew Dedesko is a backend software engineer with 13 years of experience. He became very interested in metrics and alerting after being woken up countless nights while on call. Outside of work, Andrew likes cycling, camping, making s’mores, and pancakes. Adriano Mariani is a software engineer with three years of experience specializing in backend software development. Currently, Adriano is working at Kijiji on SEO-related initiatives.

OpenTelemetry vs. Datadog: Key Differences Explained

Choosing between OpenTelemetry and Datadog isn't just another tool decision. It's about how you'll monitor your systems, troubleshoot issues, and ultimately keep your services running smoothly. If you've been tasked with figuring out which route to take, you're in the right place. Let's get started!

Getting started with GitHub Actions dashboards

If you are part of an engineering team, monitoring the performance of your CI/CD pipelines is a high priority. With the SquaredUp GitHub plugin you can view key metrics for your GitHub repos and workflows all within a single pane of glass. We also have plugins for Jira, Circle CI, Azure DevOps and more. So even if you are using many different tools you can still get an end to end view of your processes.

Mastering NIS2 Compliance: Advanced Threat Detection Simplified

In this webinar, “Mastering NIS2 Compliance: Advanced Threat Detection Simplified” we’ll demystify NIS2 and demonstrate how the Progress Flowmon Network Detection and Response (NDR) solution can streamline compliance efforts and enhance an organization's security posture.

Future Proof Your IT Monitoring with Microsoft SCOM 2025

Are you ready to transition to System Center Operations Manager (SCOM) 2025? The new release has been available since November 2024, so organizations must prepare to migrate seamlessly while maximizing new features and capabilities. This webinar will guide IT administrators, system engineers, and decision-makers through the migration process, best practices, and key enhancements in SCOM 2025.

AI Agents: Hype or Reality?

A few years ago, it was all about Blockchain; before that, IoT, then Big Data, and even earlier, the Cloud. Each era brought a paradigm shift of sorts, drawing huge investments and promises. Some delivered, some didn’t, but they each brought advancement in tech. Today, we find ourselves fully embracing the AI hype cycle that started circa 2022 with OpenAI.

How To Monitor Server Uptime

Keeping your servers online is always important for the health of your business and keeping users happy. Essentially, if you are keeping an eye on your servers, you can proactively fix problems before they blow up rather than fighting them as they arise. Setting all this up can be a breeze or a bit of a headache, depending on your servers, what metrics you're tracking, and your expertise. Either way, MetricFire’s got your back!

Best incident management tools in 2025 [45 analyzed, top 3 picks]

PagerDuty, Splunk, ServiceNow — with dozens of incident management tools on the market, how do you know which one to choose? Here's the reality — downtime costs organizations an average of $9,000 per minute. That's why companies are increasingly investing in incident management tools to reduce disruption and improve their incident response. But with the market evolving rapidly and new players emerging constantly, selecting the right tool has become more challenging than ever.

9 Essential Network Monitoring Protocols: An Overview

Network monitoring protocols are essential for keeping your network running smoothly. They are data-collection and analysis techniques that provide insights into the health of your network and can help you identify and fix network problems before they cause major disruptions. Think of your network like a city's road system: data packets are cars, routers are traffic lights, and switches are intersections.

Introducing CartShark

Ecommerce websites are more vulnerable than ever to cyberattacks. Among these threats, web-skimming attacks – also known as data exfiltration or Magecart attacks – stand as the number one threat, targeting sensitive customer data and payment information. RapidSpike is proud to introduce CartShark, a revolutionary cybersecurity platform that empowers ecommerce businesses to combat these threats swiftly and effectively.

InfluxDB 3 Core and Enterprise Architecture Highlights

Time series data innovators and open source community members following us will know that we recently released two new products: InfluxDB 3 Core and InfluxDB Enterprise. InfluxDB 3 Core is a high-performance recent data engine optimized for real-time monitoring, data collection, and streaming analytics use cases. InfluxDB 3 Enterprise builds on Core’s foundation by integrating historical analysis and data compaction, enabling efficient querying over extended time ranges.

HTTP Caching Headers: The Complete Guide to Faster Websites

The fastest website is the website that is already loaded, and that’s exactly what HTTP caching delivers. HTTP caching is a powerful technique that lets web browsers reuse previously loaded resources like pages, images, JavaScript, and CSS without downloading them again. Understanding HTTP caching headers is essential for web performance optimization, but misconfiguration can cause big performance problems.

It's time for a new approach: Edwin AI solves ITOps biggest challenges with agentic AI

For years, the term “AIOps” has been tossed around, but for IT teams, it hasn’t really brought the change it promised. Gartner coined the term, promising that machine learning and AI would forever change how we manage IT operations. Yet, the reality has been underwhelming. For most teams, traditional AIOps has amounted to little more than event management with a shiny new label.

Everything You Need to Know About OpenTelemetry Agents

If you’re reading this, chances are you’re already familiar with OpenTelemetry (OTel)—the open-source standard for collecting observability data. But what about OpenTelemetry agents? How do they work, and why do they matter? This guide unpacks everything you need to know about OTel agents—where they fit in your stack, how to set them up, and common pitfalls to watch out for. Let’s get into it.

Handling persistent storage problems in Kubernetes clusters

Persistent storage is the backbone of stateful applications running in Kubernetes. Whether you are managing databases, logs, or application states, ensuring transactional data remains intact despite pod restarts or node failures is a challenge. In this blog, we will discuss the most common persistent storage issues in Kubernetes and how to handle them with practical, real-world solutions.

How to Effectively Monitor Nginx and Prevent Downtime

Nginx is widely known for its high performance and reliability. However, just like any software running in production, it requires continuous monitoring to ensure smooth operation. Issues such as high latency, unexpected crashes, or overwhelming traffic spikes can lead to performance degradation or even complete outages. Therefore, implementing a robust monitoring strategy is crucial to maintaining the health and stability of your Nginx deployment.

Troubleshooting Kubernetes deployment failures

Do you feel like you're solving a puzzle when deploying applications in Kubernetes? You are not alone in this! When something goes wrong during application deployment, it becomes all the more crucial to diagnose the issue methodically and get things back on track. This guide walks you through practical steps for troubleshooting deployment failures efficiently.

Monitoring for Kubernetes API server performance lags

The Kubernetes API server is a key component in the control plane. Every interaction, whether deploying applications, scaling workloads, or monitoring system health, depends on the API server. Consider the human body: We have the brain as the critical organ, and the nerves function as the control system. The Kubernetes API server is like the nerve center of cluster management.

How to perform a ping check with Grafana Cloud Synthetic Monitoring

Synthetic monitoring is a critical practice to proactively track the health and performance of web applications. By simulating user interactions, this approach helps developers identify issues before they impact real users. One of the simplest forms of synthetic monitoring is known as a ping check, which verifies whether an endpoint is reachable. In this blog post, we’ll take a closer look at what a ping check is, and then walk through how to perform one using Grafana Cloud Synthetic Monitoring.

Getting started with Azure cost dashboards

As an Azure admin, it is of critical importance that you keep an eye on how much cost you are incurring running your workloads in the cloud. You also want to have sight of any deployed resources that are not contributing to business and accumulating cost over time. Using a dedicated Azure plugin, SquaredUp dashboards will help you understand your Azure costs across services, resources, locations and apps – so you can keep tabs on how much you're spending and identify opportunities to save costs.

How to Monitor Azure Cloud Services with Grafana Cloud | Demo | Observability | Grafana Labs

Microsoft Azure Cloud monitoring has never been more streamlined! In this video, Vasil Kaftandzhiev, Product Manager for Cloud Provider Observability in Grafana Cloud, walks you through how easy it is to monitor Azure Cloud Services with Grafana. With out-of-the-box dashboards, you can instantly visualize key metrics for essential Azure services like: API Gateway Queue Storage Virtual Machines Log Storage Events Hub Network Load Balancers SQL.

Our New CLI: How and Why We Made It

We are happy to announce our latest project at MetricFire: a brand-new CLI tool! Get ready to start monitoring your systems in one step - no need to modify any configuration files manually. Just run a terminal command, follow the prompts, and forward your system metrics to Hosted Graphite in minutes. In this article, we’ll share an overview of the Hosted Graphite CLI, why we’re making it, and how we’re making it.

Improve gaming app performance with Unity support in Datadog RUM

As mobile gaming evolves, players have higher expectations for seamless experiences, real-time interactions, and cross-platform accessibility. Whether you’re developing games for iOS, Android, or another mobile operating system, maintaining and optimizing the performance of your game is critical for player retention. For instance, if a mobile game becomes laggy or begins to drop frames during gameplay, players will grow frustrated and abandon the game altogether.

TCP Checks Now Available in Checkly

Checkly has always helped you monitor your APIs and web services, ensuring they stay fast, reliable, and available. But application reliability doesn’t stop there—databases, message queues, and mail servers all play a crucial role in your infrastructure. To provide full application reliability, we’re expanding into network monitoring with TCP checks. Now, you can monitor critical non-HTTP services directly in Checkly—without adding extra tools to your stack.

Why and How You Should Use Your Learning & Visiting Budget

When I joined Checkly as Junior People Operations Manager, one of the benefits that immediately stood out to me was the Learning & Visiting budget. I found myself wondering—how is this budget actually being used across the company? At the start of the year, many of our team members plan how they’ll use their learning budget—whether to enhance professional skills or pursue self-driven projects. With flexible guidelines, we encourage them to invest in what matters most.

Integrating Google SecOps with Bindplane February 2025

Google SecOps (formerly Chronicle) is Google Cloud’s security operations platform (SIEM) that helps you detect, investigate, and respond to cybersecurity threats. Integrating Bindplane enables an easy way of standardizing how you efficiently collect, process, and forward security-relevant data to Google SecOps. In this live workshop you’ll get a hands-on demo of how to configure log collection with the Bindplane Distro for OpenTelemetry Collector, and best practices for data standardization using open standards and OpenTelemetry.

Optimized IBM Power Systems Monitoring

Monitoring IBM Power Systems requires a robust, efficient, and proactive approach. With the release of NiCE HMC VIOS Management Pack v1.1, IT teams gain access to advanced monitoring capabilities designed to improve visibility, optimize performance, and ensure seamless operations within Microsoft SCOM.

Agentless monitoring for cloud VMs: Simplify scaling and observability

Managing cloud infrastructure is challenging enough without adding the burden of deploying and maintaining monitoring agents. What if there was a simpler, more efficient way to monitor your virtual machines (VMs)? In the first part of this series, we looked at the (link) and presented a better solution: agentless monitoring. Agentless monitoring is an efficient approach to observability that eliminates the need to install and manage software agents on each monitored device.

NIS2 Directive and Cybersecurity: Requirements, Risk Management, and Monitoring

The days when an antivirus and common sense were enough to guarantee an organization’s cybersecurity are long gone. Especially if you work in a critical sector. That’s why the NIS2 Directive (2022/2555) of the European Union establishes cybersecurity obligations for these key activities… and the consequences of non-compliance. These consequences are significant, so let’s analyze the regulation, when it applies, and how to implement it.

Why Super Bowl 2025 was a triumph for Internet Resilience

When you’re spending close to $8 million for a 30-second Super Bowl ad, the one thing you don’t want to leave to chance is your website—especially when millions of viewers, whether they came for the game, Kendrick Lamar, or to catch a glimpse of Taylor Swift in the stands, might head there right after the spot airs. Make no mistake: web performance is just as critical as the ad itself.

SCOM 2025 upgrade: In-place upgrade or side-by-side installation

SCOM 2025 upgrade: In-place upgrade or side-by-side installation SCOM 2025 was released last year, and now is the time to start planning your upgrade. But where do you begin? Upgrading can be a complicated process, and it is important to consider the different options to make the process as smooth as possible. When upgrading, you can choose between an in-place upgrade or a side-by-side installation, and each approach leads to different outcomes. The right path for you depends on several factors.

OpenTelemetry Metrics Explained: A Guide for Engineers

OpenTelemetry (often abbreviated as OTel) is the golden standard observability framework, allowing users to collect, process, and export telemetry data from their systems. OpenTelemetry’s framework is organized into distinct signals, each offering an aspect of observability. Among these signals, OpenTelemetry metrics are crucial in helping engineers understand their systems.

Understanding OpenTelemetry: A Practical Guide

Observability is essential for understanding how modern applications perform and behave in production. OpenTelemetry has emerged as the industry standard for collecting, processing, and exporting telemetry data—traces, metrics, and logs—without vendor lock-in. This guide will walk you through OpenTelemetry’s core components, how it works, and why it’s a game-changer for observability.

The challenges of agent-based monitoring for cloud virtual machines and how to overcome them

Imagine discovering that 40% of your cloud infrastructure went unmonitored for a week because monitoring agents failed to deploy during an auto-scaling event. This scenario isn’t just hypothetical—it’s a growing reality for organizations relying on traditional agent-based monitoring in dynamic cloud environments.

Getting Started with OpenTelemetry for Browser Monitoring

OpenTelemetry is the go-to open-source standard for observability, but when it comes to tracking frontend performance and user interactions, things get a little tricky. Unlike backend services, browsers introduce challenges like CORS restrictions, asynchronous execution, and limited access to certain telemetry data. This guide covers everything you need to know about using OpenTelemetry in the browser, from setup to best practices, advanced configurations, and real-world debugging techniques.

How to avoid blowing the budget on Azure AI

So you had a great day playing with really awesome new tech, solving big business challenges, and feeling like you really nailed it. Then you wake up the next day to an alert from Azure telling you you've blown your monthly budget and its only the first week of the month. We've all been there... right? Using any cloud service comes with a cost, but for most services the budget risk is low. Cost calculated daily isn't a problem when usage is predictable, but not everything works like that.

Search and analyze unsampled logs in real time with Live Tail

With thousands of logs generated every minute from your infrastructure, applications, services, and devices, retaining all of this data for active search and analysis can be cost-prohibitive. Because log volumes continue to grow rapidly as operations scale, it’s common for organizations to implement log management strategies and limit the amount that they store in order to minimize costs.

Integration roundup: Monitoring your modern data platforms

Modern applications increasingly rely on specialized databases and platforms to power real-time analytics and support advanced AI/ML capabilities. These tools help teams accelerate development by consolidating workflows and processes, enabling faster and more efficient data operations. That’s why Datadog has launched three new data platform integrations with Supabase, DuckDB, and Milvus.

Networks are everyone's business - TCP Checks for app developers

Checkly is the industry’s best tool to monitor your production applications. With the power of playwright, developers can test the systems they’ve developed, and roll out those tests as production monitors running from multiple geographies on the Checkly system. And Checkly monitors thousands of API endpoints with complex validation, setup and cleanup scripts, and reliable alerting. So why are we expanding into TCP-based checks?

From basics to benefits: A beginner's guide to cloud computing

Cloud computing powers everything from startups to global enterprises. With it, a new business can scale quickly without investing in expensive servers, while large organizations can store vast amounts of data and run applications seamlessly across the world. Simply put, cloud computing delivers computing resources over the internet that are scalable, cost-effective, and accessible—anytime, anywhere. Let’s break down the fundamentals of cloud computing and why it matters.

Mastering Docker for seamless application deployment

Imagine you're developing an application on your laptop. It runs perfectly, but when you deploy it on a server, things break—dependency mismatches, configuration issues, and endless debugging. Docker eliminates these problems by packaging applications and their dependencies into portable, lightweight containers. This ensures that applications run consistently across different environments, whether it's a developer’s laptop, a testing server, or a cloud platform.

How Obkio's NPO Plan Supports Organizations Making a Global Impact With Affordable Network Monitoring

At Obkio, we believe in using our resources to give back to organizations that make the world a better place. That’s why we launched our NPO Plan—a program designed to help non-profits access advanced Network Performance Monitoring at a significantly reduced cost. By offering our services to non-profits at a fraction of the price, we help them to focus on what matters most—supporting their missions, rather than worrying about IT costs.

How to Monitor Aerospike With OpenTelemetry and MetricFire

Aerospike is a high-performance, real-time NoSQL database built for speed, scale, and low-latency transactions—think millions of reads/writes per second without breaking a sweat. When you're dealing with high-throughput applications, keeping an eye on Aerospike’s performance isn't just a good idea—it's mission-critical to avoid bottlenecks, connection issues, or unexpected slowdowns.

VictoriaLogs Status Update: Heading Towards the Cluster Version

Today, we’re thrilled to share the latest updates on VictoriaLogs, your trusted open-source solution for efficient and user-friendly log management. Whether you’re just discovering VictoriaLogs or have been using it for a while, this post will walk you through the recent enhancements and give you a sneak peek at the much anticipated cluster version that’s on the horizon.

Monitor Microsoft Azure in Grafana Cloud: simplify and centralize your cloud provider observability

Organizations around the world use Microsoft Azure to power their businesses. The cloud computing platform includes hundreds of products and services organizations can use to build and manage applications, but monitoring those environments can often feel like navigating a maze of fragmented data, tools, and processes.

Enhancing Jenkins performance: Resource optimization for high-traffic workloads

Jenkins is the backbone of many CI/CD pipelines, automating builds, tests, and deployments at scale. However, when handling high-traffic workloads, such as during peak development hours, large-scale deployments, or parallel builds and pipelines, Jenkins can quickly become a resource hog, leading to slow builds, queue backlogs, and even system crashes. Optimizing resource usage is essential to ensure smooth, efficient, and scalable performance.

AI Governance in 2025: A Full Perspective on Governance in Artificial Intelligence

In a world where artificial intelligence (AI) is leaping forward — growing at a CAGR of almost 36% from 2024 to 2030 — questions about governance and ethics with the use of AI are surfacing. As humans continue to develop AI systems, it is crucial to establish proper guidelines to ensure powerful technologies like generative AI and adaptive AI are used in a responsible manner.

Lakehouse Demo

Cribl Lakehouse is the first lakehouse built for the unpredictable nature of telemetry data. Unlike traditional solutions for structured data, it eliminates schema complexity and manual transformation while delivering elastic scalability, automated, ​​cost-optimized tiered storage, and federated queries across diverse datasets. IT and security teams can effortlessly store and analyze massive volumes of evolving telemetry data in real time—without data engineering expertise—unlocking the full value of their data with a unified, management experience.

What is Hosted OpenSearch? A Complete Guide for Businesses

As data continues to grow exponentially, businesses need powerful tools to search, analyze, and visualize their data efficiently. OpenSearch has emerged as a top choice for organizations seeking an open-source, scalable search and analytics engine. However, managing OpenSearch in-house can be complex, costly, and resource-intensive. That’s where hosted OpenSearch comes in.

Shorten your MTTR with Checkly Traces

We all know that Checkly is a ‘secret weapon’ for engineering teams who want to shorten their mean time to detection (MTTD). With Checkly, you can know within minutes if your service is unavailable for users, or acting unexpectedly. In this article we’ll talk about how Checkly traces can help you expand on the benefits of Checkly, adding insights that will help you diagnose root causes, and further reduce your mean time to resolution (MTTR) for outages and other incidents.

Key metrics to monitor for optimal SQL Server performance

Microsoft SQL Server is a critical database component of many business applications, ensuring data integrity, fast query performance, and seamless transactions. However, maintaining peak performance requires proactive monitoring of essential metrics. In this blog, we’ll explore the key SQL Server performance metrics you should track and how they help prevent performance issues, optimize resource usage, and enhance database efficiency.

Optimizing AWS NAT Gateway Usage

AWS NAT Gateways are essential for private subnet access but can quickly become a costly burden, even when idle. With Kentik, cloud and network engineers gain deep visibility into NAT Gateway traffic, allowing them to identify underutilized gateways, analyze high-cost usage, and explore cost-saving alternatives like VPC Endpoints, Internet Gateways, or direct peering.

Why Context Matters: Mastering Serverless App Monitoring

Hi there, and welcome to the second video in this series on observing AWS serverless applications with Datadog. In this video, you’ll learn how important it is to add custom business context to the telemetry you send to Datadog and how you can use that inside APM to quickly diagnose and debug issues. You’ll walk away with an understanding of the importance of distributed tracing, as well as how you can add specific business context to the telemetry you send.

Managing Multiple Service Instances with a Systemd Generator

When working with systemd services in Linux, you might encounter situations where multiple instances of a service need to be managed dynamically. When I had to develop a solution to monitor multiple Kubernetes clusters with Icinga for Kubernetes, I ran into exactly this challenge.

Elasticsearch Reindex API: A Guide to Data Management

If you've been working with Elasticsearch for a while, you’ll eventually run into a situation where you need to reindex your data. Maybe you’re changing mappings, upgrading versions, or restructuring your documents. That’s where the Elasticsearch Reindex API comes in. In this guide, we'll walk through everything you need to know about the Reindex API—what it is, how it works, common use cases, performance optimizations, and potential pitfalls. Let’s dive in.

Netdata vs. Prometheus: Which Monitoring Tool is Right for You? #monitoring #realtime

Netdata's founder Costa Tsaousis built Netdata with performance and efficiency in mind. The result? 8x less RAM usage, 30x less disk I/O, 40x more data retention, 40x more data stored, and up to 22x faster queries—all thanks to our innovative tiered storage system, enabling ultra-efficient long-term queries.

GTMetrix Alternatives: The Best Tools for Website Performance Testing

GTMetrix used to be the go-to tool for checking website speed, but let’s be honest—paying for one-off synthetic tests isn’t worth it. If you’re still relying on synthetic testing alone, you’re missing a big part of the web performance picture. If you care about Core Web Vitals, SEO performance, and user experience, you need more than just lab data. The good news? There are better (and free) alternatives like PageSpeed Insights and WebPageTest for synthetic testing.

State of DevOps: 2024 DORA Report Insights with Google

Enjoy this exclusive webinar with Ben Good from Google as we explore the findings in the 2024 State of DevOps report. For over a decade, the DORA report has provided critical insights into the capabilities and practices that fuel high-performing technology organizations. This report highlights the significant impact of AI on software development, explores platform engineering’s promises and challenges, and emphasizes user-centricity and stable priorities for organizational success.

Pino Logger: The Fastest and Efficient Node.js Logging Library

Logging is an integral part of any production-ready Node.js application. Whether you're debugging issues, monitoring application performance, or setting up a centralized logging system, an efficient logger is crucial. Pino is one of the best choices available due to its speed, low overhead, and powerful features. This guide goes beyond the basics, providing an in-depth exploration of how to optimize Pino for your applications, use advanced features, and integrate it seamlessly with other tools.

Fine-tune notifications with Alert sensitivity

We’re excited to introduce a new feature that gives you greater control over how and when you receive alerts from your website and ping monitors. With Alert sensitivity, you can now specify the number of retries before an alert is triggered, reducing false alarms and ensuring more reliable notifications.

Boosting IT Efficiency: How to Do More With Less

IT teams are constantly asked to do more with limited resources and budgets. Is your IT team’s monitoring strategy keeping up? Thankfully, these challenges aren’t impossible to overcome. Check out this exclusive webinar where Greg Collins, Product Marketing Manager at Progress, and Jason Alberino, Principal Product Manager at Progress, will share tips on accomplishing your IT goals with less.

Data sources, visualizations, and apps: A guide to extending and customizing Grafana

Grafana’s extensibility has always been one of the keys to its success. It comes with a wide range of data sources that allow you to query your data no matter where it lives, visualizations to help you quickly make sense of that data, and apps that can provide complete observability solutions, all in a single package.

How to Implement OpenTelemetry in NestJS

Modern applications are becoming increasingly complex, and debugging distributed systems can feel like searching for a needle in a haystack. This is where OpenTelemetry (OTel) comes in. If you're using NestJS, integrating OpenTelemetry can provide deep insights into your application's behavior, helping you track performance, troubleshoot issues, and understand service interactions.

Empowering DevOps Teams: Overcoming IT Complexity with Advanced AI + Automation

As IT environments become more complex, larger, and inundated with data, DevOps teams encounter significant obstacles that make efficient operations more challenging. The heightened complexity can create difficulties in maintaining visibility and control across hybrid IT ecosystems. Additionally, the substantial volume of data generated can overwhelm resource-constrained DevOps teams, making it difficult to extract valuable insights and make informed decisions.

Microsoft Entra ID Outage: How Vantage DX Detected the Issue Before Microsoft Acknowledges the Issue

On February 25, 2025, at 11:32 AM EST, Martello’s Vantage DX monitoring began alerting on an issue affecting Microsoft Entra ID (Azure AD SSO). While Microsoft had not yet acknowledged the incident, online reddit forums had noted the issue and our Vantage DX proactive monitoring detected disruptions impacting authentication across multiple workloads. See here the critical warning for Exchange in Vantage DX Monitoring. Here is the critical warning for OneDrive and SharePoint in Vantage DX.

Easy, comprehensive Logstash monitoring with Elastic Agent

Logstash is a powerful tool for ingesting, transforming, and shipping data from various sources. Visibility into Logstash is critical for optimizing performance and troubleshooting issues related to data ingestion. We’ve greatly improved the Logstash integration to display the status of your Logstash nodes and pipelines at a glance. The integration is now powered by Elastic Agent, which queries Logstash monitoring APIs for data that populates managed dashboards.

Graylog Parsing Rules and AI Oh My!

In the log aggregation game, the biggest difficulty you face can be setting up parsing rules for your logs. To qualify this statement: simply getting log files into Graylog is easy. Graylog also has out-of-the-box parsing of a wide variety of common log sources, so if your logs fall into one of the many categories of log for which there is either a dedicated Input; a dedicated Illuminate component; or that uses a defined Syslog format; then yes, parsing logs is also easy.

Using Amazon RDS for high availability: How monitoring ensures reliable failover

Database downtime can lead to significant disruptions, revenue loss, and frustrated users. Amazon Relational Database Service (RDS) provides a managed database solution with high availability and automated failover to minimize such risks. However, continuous monitoring is crucial to ensuring reliable failover and minimizing downtime by detecting potential issues before they impact operations.

What are Kubernetes audit logs and how to monitor them?

Security and compliance: Many industries, especially those governed by regulations like HIPAA, the PCI DSS, or the GDPR, require detailed logs for compliance and to trace security incidents. Troubleshooting and forensic analysis: If something goes wrong—whether due to accidental configuration changes or malicious activity—having detailed logs helps diagnose the root cause and quickly remediate it.

What is Entra ID? .... and how Entra ID has evolved since the Azure AD rebranding

Entra ID is the new name for Azure Active Directory (Azure AD), Microsoft’s cloud-based identity and access management service. This rebranding, announced in July 2023, is part of Microsoft’s broader Entra product family, which focuses on securing access to digital resources and managing identities in a comprehensive way.

Challenges in Monitoring Applications That Use OAuth

OAuth (Open Authorization) has become a critical component in enabling secure and third-party access to APIs which makes it one of the most widely adopted authentication protocols for modern applications. From allowing users to sign into apps using their Google or Facebook accounts to enabling third-party service integrations, OAuth simplifies the process of granting access to resources without compromising security.

The Benefits of Investing in a High-Quality Battery Box

Having a dependable power source is crucial for various applications, from outdoor adventures and marine travel to off-grid living and emergency preparedness. While batteries provide the necessary energy, ensuring their protection and longevity requires a secure and efficient storage solution. A high-quality battery box not only safeguards the battery from environmental damage but also enhances safety, portability, and ease of use.

Deploying Prometheus with Docker Compose: A Step-by-Step Guide

Prometheus is one of the most popular open-source monitoring and alerting tools. Setting up Prometheus with Docker Compose can make your monitoring stack easier to deploy and manage if you're running containerized applications. This guide will walk you through everything you need to get Prometheus up and running with Docker Compose, from installation to configuration and setting up basic alerts.

Fix slow mobile apps before your users uninstall with Mobile Vitals

Mobile devs know the struggle. Small regressions can cause big issues in production, and fixing them isn't as easy as pushing a quick patch. Unlike a web app, shipping fixes for apps means navigating app store approvals, and often hopping on meetings with customers to debug because mobile issues can be so challenging to recreate. Catching these issues before the 1-star reviews roll in is crucial. Luckily, Sentry just made it easier than ever.

Optimizing Observability Data Volume and Cost with AI

Struggling with high observability costs? In this video, Jade Lassery breaks down the challenges of managing excessive data and skyrocketing expenses. She introduces the Logz.io AI agent, a powerful solution designed to optimize data usage, reduce unnecessary costs, and improve efficiency. Learn how to take control of your observability spending while maintaining high performance. Watch now to discover smarter data management strategies!

Why businesses lose trust after acquisition & how to choose wisely

Acquisitions are a double-edged sword. While they might seem like a sign of growth, they often leave customers dealing with slower updates, higher prices, and even privacy risks. If you’ve ever felt let down after your favorite tool was acquired, this video is for you. We’ll explore why businesses lose trust after acquisitions and share practical tips on how to choose tools that won’t leave you stranded. Plus, discover why ManageEngine is a standout choice for businesses looking for stability, innovation, and a customer-first approach.

Understanding Reverse DNS Lookup

On the information superhighway, an IP address is a series of numbers telling the location of a digital resource, similar to having a street address for a building. However, when all you know is the street address, you have no idea what the building itself looks like. If you’re a visual person, you might insert that address into Google Maps to pull up a picture of the building so you have a marker to help find a drive.

Fearless innovation is the true force behind IT project transformation

Previously, I discussed the challenges of adopting AI in enterprises, focusing on middle managers’ concerns about its impact on their roles. In case you missed it, you can read it here: AI resistance isn’t where you expect it In this post, I’ll highlight the crucial steps for ensuring successful AI adoption. All business transformations are complex by nature because they change the organizational balance – that is, the equilibrium of power held among different leaders.

How to Build Observability into Chaos Engineering

If you've ever deployed a distributed system at scale, you know things break—often in ways you never expected. That’s where Chaos Engineering comes in. But running chaos experiments without robust observability is like debugging blindfolded. This guide will walk you through how observability empowers Chaos Engineering, ensuring that your experiments yield meaningful insights instead of just causing chaos for chaos’ sake.

How to Implement OpenTelemetry in Next.js

OpenTelemetry is an open-source observability framework designed to instrument, generate, collect, and export telemetry data, including traces, metrics, and logs. It is vendor-agnostic, allowing developers to send data to multiple backend services like Last9, Prometheus, Datadog, or Jaeger without vendor lock-in. For Next.js applications, OpenTelemetry is particularly useful due to the framework’s hybrid rendering approach.

OpenTelemetry Is Not "Three Pillars"

OpenTelemetry is a big, big project. It’s so big, in fact, that it can be hard to know what part you’re talking about when you’re talking about it! One particular critique I’ve seen going around recently, though, is about how OpenTelemetry is just ‘three pillars’ all over again. Reader, this could not be further from the truth, and I want to spend some time on why.

Spoiler Alert: How "Zero Day" Might Have Played Out Differently with Teneo and Palo Alto Cortex XDR

This weekend, I binge-watched Netflix’s new series Zero Day, starring Robert De Niro. The series has sparked excitement and curiosity among cybersecurity enthusiasts and political thriller fans alike. As the title suggests, the show revolves around a cyberattack that exploits unknown vulnerabilities—so-called “zero days”—to wreak havoc on critical systems. But what if the organizations targeted in Zero Day had the right cybersecurity strategy in place?

Maximizing Azure Network Insights with VNet Flow Logs

Join Kentik’s Phil Gervasi and Chris O’Brien in this LinkedIn Live replay as they discuss how VNet flow logs in Microsoft Azure boost network observability far beyond what’s possible with NSG flow logs. Learn how easier deployment, comprehensive visibility, and advanced analytics—integrated with AI-driven query capabilities—can help optimize your Azure (and multi-cloud) environment.

How to Monitor Snowflake with OpenTelemetry

Snowflake is a powerful, cloud-based data platform designed for high-performance analytics. Whether you're running massive analytical queries, managing structured and semi-structured data, or optimizing data pipelines, visibility into your Snowflake instance is essential. Performance bottlenecks, query execution delays, and unexpected cost spikes can quickly become issues without proper monitoring.

Instrument Google Cloud Run applications with the new Datadog Agent sidecar

Google Cloud Run is a fully managed service that allows you to deploy, manage, and scale workloads on serverless containers. Because Cloud Run abstracts away infrastructure management and runs on complex, distributed backends, it can be difficult to troubleshoot. Datadog’s integrations with Google Cloud and Google Cloud Run address that challenge by collecting and visualizing key metrics and logs.

Grafana Loki 101: How to ingest logs with Alloy or the OpenTelemetry Collector

Logs play a critical role in observability, but they do come with their own challenges. Grafana Loki, our horizontally scalable, highly available, multi-tenant log aggregation system, addresses these challenges head on, giving you an open source tool that’s both cost effective and easy to operate.

February 2025 Box Outage: Timeline and Post-Mortem

Box.com is a cloud-based content management and file-sharing platform designed for the enterprise and used by nearly 100,000 companies around the world. When a Box outage strikes, businesses can experience costly disruptions. On February 19, 2025, a disruption in core Box services including uploads, downloads, and the All Files page, affected thousands who depend on the cloud storage and collaboration platform.

Migrating to cloud: Top five reasons

Since the inception of public clouds, a lot of CXOs have considered moving their IT infrastructure to the cloud and many have already done that. If your organization is considering migration to the cloud, learn what drove this mass movement from on-premises servers to the cloud. In this article, we'll explain the major reasons why organizations prefer the cloud, the issues you should watch out for, and how you should protect your cloud infrastructure.

Conquering Data Overload at Ingestion - Tech Talks #2

Join us for our second Tech Talk, where we’ll tackle log ingestion challenges and explore how VictoriaLogs makes log management effortless with the following: Modern infrastructure produces an overwhelming volume of log data, but traditional log management solutions struggle with scalability, performance, and cost.

Troubleshoot Kubernetes Performance Issues with AI

Struggling with Kubernetes performance issues? This video introduces an AI-powered agent designed to help users quickly identify and resolve bottlenecks. By analyzing logs, the AI detects performance issues, streamlining troubleshooting and improving system efficiency. Watch now to see how AI can simplify Kubernetes performance management and keep your infrastructure running smoothly!

The One Where We Meet Cribl Copilot

We’re kicking off our new live weekly product demo series—streaming on YouTube, X, and LinkedIn! Each week, we’ll dive into the latest features and hidden gems from the Cribl Suite of tools to help you unlock the full potential of your telemetry data. For our first session, we’re thrilled to welcome Nikhil Mungel, the visionary behind Cribl Copilot. This AI-powered assistant is designed to: Instantly surface answers from the documentation Build pipelines with just a simple request.

Free network monitoring: Full network visibility without the cost

Investing in a network monitoring tool should mean complete visibility and faster troubleshooting. But what happens when an unexpected outage occurs and your expensive tool misses the warning signs? The result: hours of downtime, frustrated employees, and lost business productivity. Many organizations face this challenge, realizing that even premium monitoring solutions can leave critical gaps. The good news? You don’t have to break the bank to monitor your network effectively.

Optimize MTTD with the right check frequency

Checkly enables engineers to automate the monitoring of their production services. Using the automation framework Playwright, you can run an end-to-end test on a regular cadence to make sure every feature is working for your users. But once you’ve got your check set up, either with Playwright scripting, a Terraform template, or an OpenAPI spec, we come to the question of what frequency you should run these checks. Should you be checking every few minutes, or every hour?

How Forbes delivers a premium digital experience with Datadog

Learn how Forbes, a global media powerhouse, successfully migrated to the cloud with Datadog. Discover how they enabled their teams across their entire tech stack to access IT data and make critical improvements. The team maintained a 99.5 percent uptime through proactive alerting and improved root cause analysis by 10 percent.

Breaking Free from Legacy Observability: Why Service Providers Choose Kentik Over Deepfield

Modern network operators need modern observability tools. In this post, we explore why Deepfield — a traditional network flow analytics platform — falls short in providing comprehensive insights required for today’s network operations, and how Kentik’s modern data platform is purpose-built for today’s infrastructure teams.

Increase control and reduce noise in your AWS logs using Datadog Observability Pipelines

Today’s SRE and security operations center (SOC) teams often find themselves overwhelmed by the sheer volume and variety of logs generated by critical AWS services such as VPC Flow Logs, AWS WAF, and Amazon CloudFront. While these logs can be valuable for detecting and investigating security threats, as well as troubleshooting issues in your environment, managing them at scale can be challenging and costly.

A deep dive into Database Monitoring index recommendations

Datadog Database Monitoring (DBM) Recommendations help you proactively optimize performance throughout your database fleet. DBM draws on a wide range of data sources in order to detect and provide actionable guidance on issues such as blocking queries, low disk space, and missing indexes. In this post, we’ll show you how DBM formulates targeted indexing recommendations to help you optimize database performance.

How to use locators to design more resilient synthetic tests

Most modern web applications are frequently updated to implement new features, execute marketing campaigns, or enhance their UX with new libraries or APIs. While this helps you better engage your users, constant UI updates make designing flexible, long-lasting tests challenging.

SolarWinds Observability 2025.1: Big Cloud Updates for GCP, AWS & Azure!

New cloud support has landed in SolarWinds Observability 2025.1! Now with expanded monitoring for Google Cloud, AWS, and Azure, you can track even more cloud entities with ease. What’s new? Google Cloud – Now supports Google Compute Engine Azure – New support for Azure App Service & Blob Storage AWS – Expanded RDS support (MySQL, Aurora, PostgreSQL, Oracle) + Load Balancer monitoring See it in action! We explore the latest dashboards and drill into cloud resources like virtual machines, databases, and storage.

Perses - A new language for dashboards?

One of the most interesting stories in the dashboarding space over the past year or so has been the emergence of the Perses project. This is an open source project which not only provides a platform for dashboard creation, but also sets itself the very ambitious target of defining a common standard for dashboards as code. As a SquaredUp user, you may be wondering why we might want to talk about a potentially competing technology. Well, obviously, being SquaredUp, dashboards are in our DNA.

How well-designed automations lead to efficient orchestration in AWS

Managing resources efficiently in cloud-based environments like AWS is crucial for scalability, security, and cost-effectiveness. Automation is key to eliminating manual intervention in routine tasks, while orchestration ensures that these automated tasks are executed in a structured, coordinated manner. In AWS, leveraging well-designed automation enhances orchestration, enabling organizations to optimize performance, resource utilization, and security while maintaining operational agility.

Elastic achieves AWS Government ISV Partner Competency, strengthening public sector solutions portfolio

Advancing digital transformation in government through Search AI and cloud innovation We’re thrilled to share that Elastic has achieved the AWS Government ISV Partner Competency. This prestigious designation recognizes Elastic as an Amazon Web Services (AWS) partner that has proven expertise in delivering high-quality solutions that help government agencies meet mandates, reduce costs, drive efficiencies, and boost innovation.

Guide To Confluent Kafka vs Apache Kafka

Kafka is an open-source distributed streaming platform for high-throughput and fault-tolerant real-time data streaming in large-scale systems. It can integrate with a wide range of data sources and sinks, which include databases, message queues, big data processing frameworks like Apache Spark and Apache Flink, and many more.

Getting Ready with Regex 101

If you’ve dropped your house key in tall grass, you know how difficult it is to locate a small item hiding in an overgrown field. Perhaps, you borrowed a metal detector from a friend, then returned to the field hoping to get the loud beep that indicates finding metal in an otherwise organic area. Trying to find patterns in strings of data is the same process.

OpenTelemetry Visualization Setup: A Developer's Guide

If you've ever tried to set up OpenTelemetry visualization, you know it can be a bit overwhelming. But don't worry—in this guide, we'll break it all down step by step. Whether you're just getting started or looking to fine-tune your existing setup, this walkthrough will help you get the most out of your telemetry data.

How to Use OpenSearch with Python for Search and Analytics

If you're working with search and analytics, you’ve probably heard about OpenSearch—the open-source alternative to Elasticsearch. OpenSearch is a powerful tool, whether you're building a search engine, running log analytics, or implementing full-text search in your applications. And the best part? You can integrate it easily with Python.

Making sure you get a Checkly alert for every detected failure

It’s every ops team’s biggest anxiety: a monitoring system detects a failure, but the notification either isn’t delivered or isn’t noticed by the team. Now we have to wait for users to complain before our team knows about the problem. Checkly sends an alert every time the system detects a failure, but how can you be sure you’re getting those alerts, and that those alerts are going to the right people?

Prometheus Monitoring: Instant Queries and Range Queries Explained

Prometheus Monitoring: Instant Queries and Range Queries Explained Over the years, we’ve received many questions about MetricsQL/PromQL, even from experienced users—especially regarding range queries and instant queries. This article is basic but turns out to be really important to explain why your query behaves the way it does. This discussion is part of the basic monitoring series, an effort to eliminate confusion in monitoring for both beginners and experienced users.

React.js Performance Guide

Which JS framework is the most performant? React, Vue, Svelte, Angular,…? When trying to answer this question, we often get lost in comparing benchmarks for reactivity, bundle size, memory usage and other factors. Of course we want to choose the best framework to create performant apps! But your app will only benefit from framework performance if you also follow best practices for performance optimization of web apps in general, and React apps in particular. So, where to start?

What is Network Availability: Your Guide to 99.9 Uptime

In the fast-paced world of network admins, where data flows like the heartbeat of an organization, network availability is the top priority. For admins, it's not just another term, it's a make-or-break factor for the success and smooth operation of everything they manage. In a world where we're all plugged in and counting on the constant exchange of info, getting network availability right is absolutely critical.

It was DNS Again: Why Your Status Page Needs Its Own Domain

On February 20, 2025, at 16:22 UTC, StatusGator detected an outage affecting Vultr. The issue appeared to stem from a DNS failure, causing vultr.com and any other services hosted on its domain to become inaccessible. But what does that include? The official Vultr status page. Because Vultr hosts its status page on status.vultr.com, the same domain hosting its primary website and dashboard, users were left without an official source of updates during the outage.

FinOps IT Financial Management

Cloud computing has revolutionized IT infrastructure by offering unparalleled scalability and adaptability. However, organizations face significant challenges when it comes to effectively managing their cloud costs. Traditional IT Financial Management (ITFM) methodologies, designed for on-premises operations, often struggle to address the advanced financial complexities of cloud-based investments. This is where FinOps IT Financial Management takes center stage.
Sponsored Post

Why AIX Monitoring Matters | Reasons, Obstacles, Solutions

AIX monitoring is essential for ensuring enterprise IT reliability, performance, and security. Traditional solutions often lack the depth needed for complex AIX environments, making specialized tools crucial for tracking performance and preventing downtime. As the need for real-time, automated monitoring grows, advanced solutions like NiCE AIX Management Pack integrate with Microsoft SCOM to enhance visibility and system optimization. By leveraging dedicated AIX monitoring, businesses can improve uptime, security, and efficiency, ensuring long-term infrastructure success.

DORA Compliance - An Opportunity for MSPs

For Managed Service Providers (MSPs) in the EU, who serve financial organizations, DORA regulatory compliance is a hot topic. The DORA (Digital Operational Resilience Act) is a new regulation that came into force on Jan 17th, 2025, aimed at ensuring the operational resilience of financial entities in the EU, focusing on technology risk management and minimizing disruptions in critical services.

Drilldown apps: An improved queryless experience for faster insights into your observability data

See how we're improving the apps to help you quickly get insights into your logs, metrics, traces, and profiles, and find out why we changed the name from Explore apps to Drilldown. Grafana Cloud is the easiest way to get started with Grafana dashboards, metrics, logs, and traces. Our forever-free tier includes access to 10k metrics, 50GB logs, 50GB traces and more. We also have plans for every use case.

Enhance Network Performance Management With Next-Gen AIOps: Configuring Integration of DX Spectrum With DX Operational Observability

To unlock the power of observability and advanced analytics of AIOps, teams need to collect exceptional monitoring data, establish connections and correlations between the data, and understand context with the help of robust and current topological maps. Because modern networks often span on-premises, cloud, and hybrid infrastructures, monitoring their performance and troubleshooting issues can be difficult. These complex infrastructures often lead to observability gaps for network teams.

Intelligent Alerting with RapidSpike and ilert Integration

When it comes to website performance and uptime, every second counts. Businesses rely on tools like RapidSpike to monitor their digital presence, ensuring websites and applications run smoothly. However, effective alerting and incident management are just as critical as monitoring itself. That’s where ilert comes in.

Debugging a .NET Application with Loggly

As modern applications grow more complex, debugging becomes increasingly challenging. Applications consist of multiple parts which can generate enormous amounts of log data, making debugging difficult. SolarWinds Loggly can help store, manage, and sift through this data. To demonstrate, we’ll set up an application built on.NET Core 9.0 and MongoDB; then, we’ll walk through how to export its logs to Loggly.

Nexthink Recognized in G2's 2025 Best Software Awards

We are honoured to announce that G2 has recognized Nexthink as a leading software company in EMEA in their 2025 Best Software Awards. This achievement builds on our ongoing recognition from G2, where we’ve been named a Category Leader consecutively since 2021. Our success is only possible because of the support we receive from our incredible DEX community.

CLM Chowder: Digging Into the Cloud Latency of Azure, Google Cloud, and OCI

CLM Chowder is a new series which highlights notable observations of cloud connectivity surfaced by Kentik’s Cloud Latency Map. In this edition, we look at measurements from Alibaba (China), latency swings from South Africa, and a temporary latency jump from Marseilles to Asia.

The next generation of Grafana Mimir: Inside Mimir's redesigned architecture for increased reliability

This year Grafana Mimir — the open source, horizontally scalable, multi-tenant time series database (TSDB) — will celebrate its third anniversary. Over the years, Mimir has become the go-to, Prometheus-compatible metrics backend within the open source community, with 29 maintainers and more than 4.6k GitHub stars. Since introducing Mimir, we’ve worked hard to deliver on our promise of making it the most scalable and performant open source TSDB in the world.

Grafana Drilldown apps: the improved queryless experience formerly known as the Explore apps

When we introduced the Explore apps suite for metrics, logs, traces, and profiles last year at ObservabilityCON 2024, our goal was simple: offer a queryless, point-and-click experience so you can quickly find insights in your observability data—no queries or complicated syntax required. Our commitment to that goal remains unchanged, but we’re excited to announce that the Explore apps have a new name: Grafana Drilldown.

Why Internet Performance Monitoring is the new health check for IT organizations

Monitoring has been part of our lives for centuries. We watch ourselves, our environment, and our habits to gain insights and make better decisions. Even the much-dreaded annual health check we line up for each year is just another facet of this age-old process. The goal is simple: spot small red flags now, before they balloon into bigger health complications later. It’s the same principle that has guided us for generations—keeping tabs, so we can correct course before trouble takes hold.

KubeCon 2024 | Interviews with Observability Experts | Observability Insights with Aunsh Chaudhari

In this interview from KubeCon 2024, I sit down with Aunsh Chaudhari, a Product Manager at Splunk, to discuss the biggest trends shaping observability today. With a background in software development and hands-on experience with observability tools, Aunsh shares insights on OpenTelemetry adoption, cost optimization strategies, and the shift toward unified observability. We also touch on emerging topics like AI in observability and the challenges of scaling observability in modern environments.

Integrating OpenTelemetry with Grafana for Better Observability

Modern application observability is essential for ensuring system performance, diagnosing issues, and optimizing user experiences. OpenTelemetry (Otel) and Grafana serve as two key components in achieving end-to-end visibility. While OpenTelemetry focuses on instrumenting applications to collect telemetry data, Grafana specializes in visualizing this data, making it actionable and insightful.

An In-Depth Guide to Java Performance Monitoring for SREs

If you've ever had a Java application slow down in production and struggled to pinpoint the cause, you know the pain of performance issues. Java is a powerful, high-level language, but it doesn’t come without challenges—especially when it comes to resource management, garbage collection, and thread handling. This guide will take you through everything you need to know about Java performance monitoring, from key metrics to tools and best practices.

OpenTelemetry UI: The Ultimate Guide for Developers

If you’ve ever struggled with understanding distributed traces, managing metrics, or debugging complex applications, OpenTelemetry is your best friend. But what about the OpenTelemetry UI? How do you visualize and interact with all that telemetry data? In this guide, we’ll explore the best ways to use OpenTelemetry’s UI options, from setting up a proper observability stack to choosing the right front-end visualization tools.

How to Optimize Websites for Ad Publishers

Optimizing a website for ad publishers is a must for anyone that is looking to maximize ad revenue and improve their user experience. A fast, well-optimized website ensures better engagement, higher ad visibility, and increased revenue potential. Additionally, search engines favor well-performing sites which leads to more organic traffic and greater ad impressions. In this updated guide, we’ll explore fresh strategies to enhance website optimization for ad publishers.

OpenTelemetry: The Future of Observability with Advanced Tracing and Metrics

Hey there! Oscar here. After spending countless hours wrestling with various monitoring tools and proprietary solutions, I wanted to share my thoughts on what I believe is revolutionizing the observability landscape: OpenTelemetry (OTel). OpenTelemetry revolutionizes observability in distributed systems.

Transform Data with the New Python Processing Engine in InfluxDB 3

In early January, we announced the launch of InfluxDB 3 Core and InfluxDB 3 Enterprise in public alpha. One of the newest included features is the InfluxDB 3 Processing Engine–a Python-based VM built to enable data transformation, enrichment, downsampling, alerting, and more, all from within the database itself. One month later, we’re excited to deliver a big update enabling new ways to interact with and transform your data.

Finding Root Cause Quickly with Logz.io AI Agent

In the video, Jade Lassery discusses how to effectively manage complex environments, especially when faced with unexpected spikes in errors. She introduces a Logz.io AI agent prompt that assists users in quickly identifying the root cause of these issues. By simply asking the right questions, users can streamline their troubleshooting process and enhance their operational efficiency.

Kubernetes made simple: A beginner's guide to managing containers

As applications become more complex, managing containers efficiently is key to scaling and maintaining performance. Kubernetes (also known as K8s) automates this process, making it easier to handle scaling, failures, and uptime. If you're new to Kubernetes, understanding the platform and how it's used is essential for managing your applications seamlessly. Let’s dive in and explore how Kubernetes makes it all possible.

Understanding Root Cause: Domain Name Systems (DNS) and Traceroute

You can think about a website the same way you think about your car. Every time something breaks, a professional—an engineer or a mechanic—usually charges a high amount for the fix (isn’t it annoying when you can’t tell if it’s a big or small fix?). Alternatively, you can learn some basics, get a few inexpensive tools, and troubleshoot many of the immediate issues yourself.

Logging vs. Metrics

When discussing observability, the “big 3” - logs, metrics, and traces, always get mentioned. But for some, more data doesn’t always mean better. Our lead engineer, JJ, had some advice to share about how logs may not be necessary for everyone. Simplifying your observability stack isn’t difficult - you just need to be intentional with implementation. Check out more MetricFire blog posts below, and our hosted Graphite service! Get a free trial and start using MetricFire now!

How APM and synthetic monitoring work together for better performance

Imagine this: A customer tries to log in to your app, but the page takes too long to load. Frustrated, they leave. Meanwhile, your IT team has no clue there was an issue—until complaints start pouring in. Sound familiar? Performance lags are the new downtime. Lags are not just an inconvenience—they lead to lost revenue and frustrated users. To prevent this, organizations turn to application performance monitoring (APM) and synthetic monitoring to maintain peak application performance.

Getting started with Snyk dashboards

If you are involved in software development you will probably be aware of the ever-growing menace of supply chain attacks. These are attempts by attackers to insert malicious code into code libraries which might be downloaded or referenced by developers. Many modern frameworks can install hundreds or even thousands of dependencies, so the potential attack surface can be huge. As well as code libraries, attackers can also attempt to conceal malware in sources such as Docker images or CDNs.

Diagnosing and resolving the 500 internal server error with Apache and Tomcat logs

The dreaded 500 internal server error is a common challenge for web administrators, often signaling a disruption in server operations. Diagnosing the root cause requires in-depth visibility into both web server and application behavior. In this blog, we’ll explore how log management tools simplify the diagnosis and resolution of 500 errors by leveraging insights from both Apache and Tomcat logs.

How to leverage AI to enhance network monitoring in retail: A CXO's guide

The retail industry has evolved into a mix of physical stores, e-commerce, digital payments, and omnichannel interactions. Now, GenAI has been added to this mix, which changes how people shop, how retailers operate, and how employees work. While this shift creates opportunities for retailers of all sizes, it also presents serious challenges in maintaining network performance and staying compliant with industry regulations.

Diagnosing ActiveMQ broker performance issues with log analysis

Apache ActiveMQ is a widely used message broker that enables seamless communication between distributed applications. However, as the volume of messages increases, performance bottlenecks can arise, leading to slow message processing, high latency, broker crashes, and out of memory (OOM) errors. One of the most critical issues affecting ActiveMQ is OOM errors, which occur when the broker exceeds its allocated heap memory. This can result in service failures, message loss, and prolonged downtime.

HTTP/3 is Fast!

HTTP/3 is here, and it’s a big deal for web performance. See just how much faster it makes websites! Wait, wait, wait, what happened to HTTP/2? Wasn’t that all the rage only a few short years ago? It sure was, but there were some problems. To address them, there’s a new version of the venerable protocol working its way through the standards track. Ok, but does HTTP/3 actually make things faster? It sure does, and we’ve got the benchmarks to prove it.

Getting started with Postgres dashboards

In the last few years, Postgres has experienced a meteoric rise in popularity. A relational database that not long ago was relatively unknown outside of academic circles has now eclipsed MySql as the most popular database for developers in the most recent StackOverflow user survey. Why has it achieved such impressive popularity with developers?

Manage your network with ManageEngine Site24x7!

As a network administrator, you know how critical it is to ensure seamless network performance, optimize bandwidth, and secure your infrastructure. But with the growing complexity of modern networks, staying on top of everything can be overwhelming. That’s where ManageEngine Site24x7 comes in! In this video, we dive into how Site24x7, a comprehensive network observability solution, empowers you to.

Scale Time Series Workloads on AWS: Introducing Amazon Timestream for InfluxDB Read Replicas

The world runs in real-time. From industrial automation and IoT monitoring to AI-powered analytics, developers rely on time series data to power critical systems and make split-second decisions. But as workloads grow, so do the challenges: keeping queries fast, ensuring high availability, and scaling efficiently without adding operational complexity. Not having to worry about operational overhead enables companies to focus on deriving value from their data.

Manage All Your App Notifications in One Place with AppSignal

Alerts and notifications are the backbone of any Application Performance Monitoring (APM) tool, ensuring your team is immediately aware of critical issues. At AppSignal, we’re always improving our toolkit to help you stay ahead of problems before they impact performance or reliability. We've made huge improvements to how you can manage your app notifications and alerts with AppSignal.

How to do Agentless Monitoring with check_by_ssh

The fundamentals of Icinga 2 are check plugins. They are being executed and their return value is mapped to either Host or Service objects. Everything else follows on top. These check plugins can be either from the Monitoring Plugins or custom. While their origin does not matter, they are the building blocks of an Icinga monitoring stack. If a plugin goes CRITICAL, Icinga 2 alerts the sysadmin.

Grafana Cloud updates: Exemptions in Adaptive Logs, GPU monitoring in AI Observability, and more

We consistently roll out helpful updates and fun features in Grafana Cloud, our fully managed observability platform powered by the open source Grafana LGTM Stack (Loki for logs, Grafana for visualization, Tempo for traces, and Mimir for metrics). In case you missed them, here’s our monthly round-up (the first of 2025!) of the latest and greatest Grafana Cloud updates. You can also read about all the features we add to Grafana Cloud in our What’s New in Grafana Cloud documentation.

Optimizing Database Performance, Episode 1: The Solid Foundations of Database Design

Join our resident database expert, Kevin Kline, for our upcoming webcast, “Optimizing Database Performance, Episode 1: The Solid Foundations of Database Design.” We’re going back to basics and focusing on how poor database design can impact even the most powerful and expensive hardware.

Breakpoint recap: Uptime Monitoring, robots, and feature flags galore

Bugs don’t announce themselves politely. They crash your checkout flow, break authentication, or slow your API to a crawl—usually right before your CEO asks how things are going. And when the error inbox is flooded with a hundred variations of TypeError: cannot read property of undefined, figuring out what actually matters can feel impossible.

What Are Network Packets & How to Monitor Them: The Secret Life Of Network Packets

Ever wonder how the Internet actually works? It’s not just magic (though it sometimes feels that way). Behind every webpage you load, every video call you make, and every meme you send, tiny digital messengers called network packets are zipping through cyberspace, carrying data from one point to another. Think of them as the text messages of the Internet; small, efficient, and sometimes frustrating when they don’t arrive on time. But what exactly are network packets? How do they work?

Bindplane Expands Partnership with Google Cloud

We're only one month into 2025, but the momentum keeps building at Bindplane. In January, we rebranded our company as Bindplane, aligning our company name with our core mission: delivering the best OpenTelemetry-native telemetry pipeline on the market. Building on that excitement, we have another announcement: we've expanded and extended our partnership with Google Cloud.

Investigating Kubernetes Issues with Papertrail

While Kubernetes aims to streamline containerized application management, its multi-layered architecture creates potential points of failure. Problems in any of these layers can manifest as application crashes, resource overutilization, or failed deployments, making cluster maintenance a persistent challenge. Kubernetes meticulously logs all aspects of cluster activity and application output, from individual Pods to ReplicaSets.

Slicing Up-and Iterating on-SLOs

One of the main pieces of advice about Service Level Objectives (SLOs) is that they should focus on the user experience. Invariably, this leads to people further down the stack asking, “But how do I make my work fit the users?”—to which the answer is to redefine what we mean by “user.” In the end, a user is anyone who uses whatever it is you’re measuring.

The integration directory is here!

Keeping your team informed about service disruptions has never been easier. The StatusGator Integration directory, available in the Menu on our website, allows you to explore all the integrations we support in one place. From collaboration tools to incident management platforms, we help you integrate status updates seamlessly into your workflow.

Beyond Their Intended Scope: BGP Goof-UPX

In this second installment of Beyond Their Intended Scope, we analyze a recent BGP leak out of Brazil that briefly affected networks around the world. Because this routing mishap was a path leak (i.e., did not involve any mis-originations and therefore immune from RPKI ROV protection), it demonstrates why we need a thing called ASPA … ASAP.

Easiest Way to Monitor NGINX Performance with OpenTelemetry

If you're looking for a straightforward way to collect NGINX metrics via OpenTelemetry and send them to your Graphite-based monitoring setup, this article is for you! With minimal configuration you’ll be collecting key metrics from your NGINX connections within minutes. In this article, we'll explain how to install the OpenTelemetry Collector, and easily configure it to receive and export NGINX metrics to a Hosted Carbon endpoint.

AI & Gartner's Strategic Roadmap Timeline for Cybersecurity - A Perspective from Teneo

The integration of artificial intelligence (AI) presents both unprecedented opportunities and emerging threats. Gartner’s Strategic Roadmap for Cybersecurity Leadership emphasizes the need for adaptive strategies that align with business objectives and technological advancements. Concurrently, the UK’s National Cyber Security Centre (NCSC) has highlighted the dual-edged nature of AI in its report on the impact of AI on cyber threats.

Observability for your NodeJS AWS Serverless Applications

Hi there, and welcome to the first video in this series on observing AWS serverless applications with Datadog. In this video, you’ll learn how easy it is to get started observing your serverless NodeJS applications using Datadog and the AWS CDK. You’ll also look at how you can use the Datadog console to diagnose latency issues and errors inside your application. You’ll walk away with an understanding of how to instrument your Lambda functions with the AWS CDK, as well as practical steps you can take to debug your applications.

How to observe AWS Lambda functions using the OpenTelemetry Collector and Grafana Cloud

Getting telemetry data out of modern applications is very straightforward—or at least it should be. You set up a collector that either receives data from your application or asks it to provide an up-to-date state of various counters. This happens every minute or so, and if it’s a second late or early, no one really bats an eye. But what if the application isn’t around for long? What if every second waiting for the data to be collected is billed?

Self-Healing Infrastructure: Start Your Journey Now

Every CIO’s ultimate goal is to create a self-healing enterprise. Self-healing IT systems have the ability to proactively prevent issues within the IT environment, ensuring seamless and uninterrupted services that support business continuity. While automating every possible task seems like an obvious solution, implementing changes in a production environment can be challenging.

Analysts Share Their 2025 Cybersecurity Predictions

It's the start of a new year. Like last year, I want to examine what analysts are predicting for the cybersecurity landscape in 2025 and the risks they feel will be front and center. There is no shortage of predictions for this year’s cybersecurity landscape outlook—so many, it's impossible to compile them all. While not a thorough summary of the threats and risks in 2025, this article highlights the most common topics covered by cybersecurity specialists.

A Quick Guide for OpenTelemetry Python Instrumentation

OpenTelemetry is an open-source tool that helps you keep an eye on your application’s performance. Whether you’re building microservices, using serverless setups, or working with a traditional monolithic app, it’s crucial to monitor and trace your app’s behavior for debugging and optimization. OpenTelemetry's Python instrumentation is an excellent way to track traces, metrics, and logs across your entire app.

AIOps Across the Board: 3 Industry Use Cases That Leverage the Power of the ScienceLogic Platform

The ScienceLogic AI Platform and Skylar AI enable organizations to maintain the performance, health, and security of their IT environments. By providing comprehensive observability enhanced by unsupervised artificial intelligence, they turn data into actionable insights. With its unparalleled visibility and intelligence across complex multi-cloud, hybrid, and on-premises infrastructures, IT teams across the globe use ScienceLogic to proactively monitor, automate, and optimize operations.

Tomcat Logs: Locations, Types, Configuration, and Best Practices

Apache Tomcat logs are essential for monitoring, debugging, and maintaining Java applications running on Tomcat. These logs capture critical information such as server startup details, request handling, and application errors. They help developers and system administrators troubleshoot issues, analyze traffic, and ensure application stability. Tomcat generates multiple logs, each serving a distinct purpose.

Helm vs Terraform: A Detailed Comparison for Developers

When managing infrastructure and deploying applications in a cloud-native environment, two popular tools that developers often compare are Helm and Terraform. While both are used to automate deployments, they serve different purposes and operate in distinct ways. Understanding the differences can help you make the right choice for your use case.

Eliminate log sprawl and cut costs with Sumo Logic

How much money is your company wasting on using multiple tools for log ingestion? Security analysts, developers, and operations teams all rely on logs. But, when each team uses different and multiple tools to store and analyze logs, it leads to tool sprawl, wasted resources, and lost critical data. With Sumo Logic’s Log Analytics Platform, you get a single source of truth for all your log data. Gain context-driven insights into your performance, availability, security status, and threats, all while eliminating wasteful spending.

Stronger together: (Agentic) AIOps and observability are the keys to IT resilience

Every new layer of infrastructure piles onto an already fragile web of interconnected challenges, making it painfully clear: traditional monitoring can’t keep up. You’re drowning in alerts, buried in data, and yet somehow still flying blind when real issues arise. More notifications don’t mean more insight, and more data doesn’t guarantee better decisions.

How Dotcom-Monitor Enhances Your API Monitoring

APIs (Application Programming Interfaces) play a crucial role in connecting applications, facilitating data exchange, and ensuring seamless user experiences. However, APIs are only as effective as their reliability and performance. Without proper monitoring, even the most well-designed API can encounter issues such as slow response times, unexpected downtime, or security vulnerabilities.

Enhancing Network Reliability: How to Measure, Test & Improve It

Whether you're a business owner or an IT pro, you know that a solid network is the foundation of your organization’s success. And that’s where Network Reliability comes in. Think about it: a key video call with a client, a crucial file transfer right before a deadline, or an online transaction on your e-commerce site. What do they all have in common? They all depend on a network that just works—no glitches, no interruptions.

Effortless, Cost-Effective VMware Monitoring with NiCE

Managing a VMware environment can be complex, time-consuming, and expensive — unless you have the right monitoring solution. At NiCE, we pride ourselves on delivering intuitive, cost-effective monitoring solutions that simplify IT operations. One of our recent customers shared their experience with the NiCE VMware Management Pack, and their words speak for themselves.

Why use Playwright in Catchpoint for synthetic monitoring

Modern websites demand constant oversight to ensure every click, login, and checkout runs smoothly. That’s where synthetic monitoring shines: it acts like a tireless, virtual visitor that spots performance hiccups before they can bother real users. Our Internet Performance Monitoring (IPM) platform features Playwright support. You can run new or existing Playwright scripts with little to no changes.

Integrating FinOps and ITSM for Optimal Cloud Cost Management

The adoption of cloud computing has revolutionized how businesses manage IT infrastructure accountability and budget control. As cloud offerings become increasingly complex and scalable, modern business environments demand improved financial management practices. Through its data-driven and collaborative approach, FinOps IT Service Management bridges the gap between engineering teams, business units, and finance departments, ensuring maximum cloud benefit consumption while optimizing expenses.

An Easy Guide to OpenFeature Flagging

In software development, feature flags have become an essential tool for teams looking to deploy code with more control and agility. OpenFeature flagging, in particular, stands out as an open-source standard that’s revolutionizing how teams manage feature rollouts, experiments, and toggling. In this guide, we’ll understand what OpenFeature flagging is, its key benefits, how to implement it, and best practices to help you get the most out of it.

What is DynamoDB Throttling and How to Fix It

When you're working with DynamoDB, one of the most critical things you need to keep an eye on is throttling. If you're not careful, throttling can severely impact your database's performance. It’s not just about slower response times—throttling can lead to system failures or unexpected downtime if not addressed properly.

Wiring Up a Next.js Self-Hosted Application to Honeycomb

Are you attempting to connect Honeycomb to a standalone (not hosted with Vercel) Next.js application? Most of the Next.js OpenTelemetry samples in the wild show how to connect Next.js to Vercel’s observability solution when hosting on their platform. But what if you’re hosting your own standalone Next.js server on Node.js? This blog post will get you started ingesting your Next.js application’s telemetry into Honeycomb.

Why a mobile app is the key to better incident communication

While downtime is inevitable, communication should remain swift and transparent. Businesses need a way to relay updates as incidents unfold, ensuring customers, internal teams, and stakeholders stay informed in real time. Relying on emails and web-based updates alone is no longer enough. A mobile-first approach is the solution.

Introducing the Middleware Adoption Journey

Middleware plays a crucial role in modern IT infrastructure by enabling seamless communication between applications, systems, and services. It facilitates data exchange, enhances interoperability, and supports various business functions by providing capabilities like messaging, transaction management, and integration services. Over time, middleware has evolved from simple message brokers to sophisticated platforms supporting APIs, cloud computing, microservices, and event-driven architectures.

Top reasons why businesses lose trust after acquisition and how you can be smart

Did you wake up to the news that your favorite tool was acquired? You probably got used to the tool's intuitive interface, cost-effectiveness, and feature set, which aligned perfectly with your day-to-day requirements. Your disappointment doesn't end here. It's just the beginning of a series of potential negative consequences of acquisitions.

Passing the Phone: Auvik Way Edition

The Auvik Way, in Action! At Auvik, our culture isn’t just words on a page—it’s what drives us every day. The Auvik Way is a set of principles that shape how we work, collaborate, and grow together. At RKO, we decided to bring it to life in a fun way! We asked Auvikians to "pass the phone" to a coworker who truly embodies one of our principles—and the results were nothing short of inspiring. Check out the video to see how our team uplifts, supports, and celebrates each other.

FOSDEM 2025 recap

In case you haven’t heard about it yet, FOSDEM (Free and Open Source Software Developers’ European Meeting) is a huge, free, gathering for open-source software enthusiasts that happens every February in Brussels, Belgium. It’s a non-profit event put together by the community, and it’s one of the biggest of its kind - we’re talking about around 10,000 people from all over the world coming to hang out and talk about all things open source.

SRE Challenges & APM Solutions

Site Reliability Engineers (SREs) face constant challenges as cloud environments and microservices grow more complex. Performance issues often go unnoticed until they escalate, leading to downtime and disruptions. With Site24x7 APM, you can stay ahead of issues before they impact your business. Our Application Performance Monitoring (APM) solution provides real-time insights, predictive analytics, and deep visibility across your entire IT ecosystem—helping you.

Native AWS Integrations with AutoDiscovery

For developers, the main quest is building and scaling their applications—not struggling with complex monitoring setups. Yet, observability in cloud-native environments is essential, and configuring monitoring for AWS services has traditionally been a complex and manual process. Developers had to set up Firehose streams, CloudWatch metric streams, and log subscriptions, all while ensuring continuous maintenance for new instances, turning observability into an unwelcome side quest.

High Cardinality Explained: The Basics Without the Jargon

Cardinality refers to the number of unique values in a dataset column. A column with many distinct values—like a user ID or timestamp—has high cardinality, while a column with limited distinct values—like a boolean flag (true/false) or a category with a few possible options—has low cardinality. For example, consider a database of an e-commerce platform.

Log Retention: Policies, Best Practices & Tools (With Examples)

Logs are the backbone of debugging, security, compliance, and performance monitoring. But if you don’t manage retention properly, you’ll either drown in unnecessary data or lose critical insights too soon. Log retention is all about striking a balance between keeping what’s necessary and discarding what’s not.

Understanding Syslog Formats: A Quick and Easy Guide

Syslog is the backbone of logging in many Linux and Unix-based systems, playing a crucial role in monitoring, debugging, and auditing. But not all syslog messages are created equal. Depending on your system, software, and logging configuration, syslog messages may follow different formats. This guide walks you through the different syslog formats, why they matter, and how to work with them effectively.

What is agentic AIOps, and why is it crucial for modern IT?

Every minute of system downtime costs enterprises a minimum of $5,000. With IT infrastructure growing more complex by the day, companies are put at risk of even greater losses. Adding insult to injury, traditional operations tools are woefully out of date. They can’t predict failures fast enough. They can’t scale with growing infrastructure.

Managing resource contention in Google App Engine: Best practices for optimal performance

Use case 1: When unexpected traffic surges lead to slower responses A sudden surge in user traffic during a high-demand event causes strain on resources in a cloud-based application running on App Engine. The platform automatically scales instances to handle the increased load, but since compute resources are shared, some instances experience CPU throttling. This leads to slower response times, delayed processing of critical operations, and potential errors that impact user experience. How to resolve it.

What is Time Series Data?

Time series data is particularly prevalent, seen across numerous different industries and use cases. It offers significant value to various organizations, highlighting the importance of effectively monitoring and analyzing the data. By analyzing and monitoring time series data you can understand trends, patterns, and anomalies in sequential data collected at many points in time.

Introducing Learning journeys: New step-by-step guides to get started with Grafana

Our Big Tent philosophy provides the foundation for our broad, modular, and flexible observability platform. With Grafana’s powerful ability to integrate with a wide range of data sources, tools, and plugins, you can create customized solutions tailored to your unique needs.

The Role of ServiceOps in Enhancing IT Service Delivery and Efficiency

Providing quick and effective IT services is paramount for organizational achievement in dynamic business operations. Technology development creates new obstacles for IT teams that must sustain service excellence and operational effectiveness standards. Recently developed ServiceOps implements a transformation of IT service management (ITSM) that surpasses all organizational needs.

Improve developer experience and collaboration with Software Catalog

As software ecosystems grow more complex and fragmented, organizations are finding it harder to manage the thousands of interdependencies that make up their environments. For starters, engineers are collectively struggling to uphold security and reliability standards throughout their organizations because they lack a shared view of these complex software landscapes.

SolarWinds 2025.1: New Network Device Support You Need to See!

Discover what’s new in SolarWinds Platform 2025.1! This update brings expanded network device support for Aruba, Fortinet, Ruckus Smart Zone Wireless, and Extreme Networks. Get hardware health insights, Layer 2 & 3 metrics, VLAN details, routing table utilization, and more!

How to use APM data to improve your CI/CD pipeline performance

Agile production has become the norm for software development cycles. The backbone for such a fast-paced landscape is the continuous integration and continuous delivery (CI/CD) pipeline. But merely depending on the CI/CD pipeline isn’t enough, even though the automated workflows give you a competitive edge. The pipeline needs to be optimized to function at its best. This is where monitoring your applications within the pipeline can be a game-changer.

The Advanced Data Compression Techniques That Quietly Power Logz.io's AI Observability Agents

As an observability leader, at Logz.io, we pride ourselves on continuous innovation. That’s why, last year, we released our AI agents to revolutionize observability by helping businesses, and their engineering and DevOps teams, automate data analysis and root cause analysis. The primary way in which engineering and DevOps teams interact with the agents is by asking performance, troubleshooting, and optimization-related questions.

Types of Pods in Kubernetes: An In-depth Guide

When working with Kubernetes, pods are the fundamental building blocks of deployment. But not all pods are created equal. Understanding the different types of pods and their use cases is crucial for optimizing workloads, ensuring reliability, and maintaining efficiency in your cluster. Let's break it all down.

Telemetry Data Platform: Everything You Need to Know

As systems grow more distributed and complex, having a reliable way to monitor and understand what's happening across your infrastructure becomes essential. Telemetry data provides the visibility needed to keep everything running smoothly, whether you're managing microservices, cloud environments, or sophisticated AI systems. In this guide, we’ll break down what a telemetry data platform is, why it’s so important, and how you can choose the right one to meet your needs.

Uptime Monitoring: A Complete Beginner's Guide

Uptime monitoring checks whether a website, server, or online service is available. It runs automated tests at set intervals, verifying responses and sending alerts if a failure occurs. Businesses rely on uptime monitoring to detect issues early, prevent revenue loss, and maintain customer trust. A website outage can harm reputation, impact SEO rankings, and disrupt operations.

Deeper Trace Analytics - Analyze Root & Entry Spans with Ease

Debugging distributed systems can often feel like searching for a needle in a haystack. When issues arise, engineers need faster ways to pinpoint critical spans within their traces. With our latest Deeper Trace Analytics update, SigNoz now enables powerful filtering for root and entry spans—making it significantly easier to analyze and debug distributed traces.

Discovering the Magic Behind OpenTelemetry Instrumentation - Jose Gomez-Selles | Fosdem 2025

Instrumentation is the secret ingredient that brings observability to life, revealing the intricate workings of applications in ways logs and metrics alone can’t match. In this talk, we’ll dive deep into the magic of OpenTelemetry instrumentation, exploring how to uncover hidden insights within your applications and services.

Evaluating Cloud Gateways for Cost and Performance

Cloud networking costs can escalate due to inefficient routing and limited visibility. Kentik’s cloud visibility and analytics solution helps engineers optimize transit, reduce costs, and improve performance by analyzing AWS Transit Gateways and exploring alternatives like direct peering, storage endpoints, and AWS CloudWAN.

From Detection to Prevention: Leveraging InfluxDB for Cybersecurity and IoT Threat Mitigation

Cybersecurity in the Industrial Internet of Things (IIoT) is often overlooked despite powering critical infrastructure such as energy grids, telecom networks, factories, robotics, and aerospace, all of which are prime targets for cyberattacks and data breaches. A single breach can disrupt essential services or expose sensitive data. So, how do we stay ahead of bad actors and proactively defend these systems?

Preempting Problems in a Sociotechnical System

Here at Honeycomb, we emphasize that organizations are sociotechnical systems. At a high level, that means that “wet-brained” people and the stuff they do is irreducible to “dry-brained” computations. That cashes out as the inability to ultimately remove or replace people in organizations with computers, in spite of what artificial general intelligence (AGI) ideologues would have you believe.

Crafting effective cloud architecture diagrams: A comprehensive guide

Cloud architecture diagrams play a crucial role in communication, planning, and execution within the realm of cloud computing. They provide a visual depiction of the infrastructure, highlighting the interconnections between different components and their collaborative functionality. In this guide, we will delve into the five fundamental factors that every cloud architect should consider when crafting a cloud infrastructure.

Grafana Loki 3.4: Standardized storage config, sizing guidance, and Promtail merging into Alloy

The Grafana Loki 3.4 release is here, and it brings a fresh wave of enhancements aimed at standardizing Loki’s object storage, helping you right size your instance, and improving the ability to ingest out-of-order logs. Loki 3.4 also represents the official merging of Promtail into Grafana Alloy as part of our efforts to give our users a single telemetry collector. There’s a lot to go over, so let’s dive in.

The ROI of Developer-First Observability: Why It's a Game Changer

In today’s fast-paced software landscape, downtime is costly, debugging is time-consuming, and developers are constantly under pressure to resolve issues quickly. Observability tools have traditionally been built for operations and SRE teams, focusing on post-mortem analysis rather than proactive debugging. When developers gain real-time insights into live applications and fix issues without disrupting the software lifecycle it has been proven to be a game changer for a myriad of reasons.

Scraping NGINX Metrics with OpenTelemetry & Exporting to Carbon

Looking for a straightforward way to collect NGINX metrics with OpenTelemetry and send them to your Graphite-based monitoring setup? Unlike Prometheus, which requires configuring scrape jobs and query language nuances, Carbon/Graphite offers a simpler setup with minimal overhead—just send metrics as plain text and query them easily with familiar tools like Grafana. Whether you're setting up dashboards, alerts, or just keeping an eye on traffic, this guide will get you actionable insights in no time!

Challenges in designing AWS architecture

Designing AWS architecture is a complex task. It requires careful planning; a deep understanding of cloud services; and the ability to balance performance, cost, security, and scalability. As organizations migrate to the cloud or expand their existing cloud infrastructure, they often face several challenges that can impact the success of their architecture. Once the architecture is deployed, effective cloud monitoring becomes critical to ensure optimal performance and reliability.

Simplifying Kubernetes architecture for DevOps

Kubernetes has become the go-to platform for managing containerized applications, but its architecture can seem complex to DevOps teams. Let’s break it down into simple terms and explore how tools like Site24x7 can simplify the process of designing and monitoring Kubernetes architecture.

Learn about cloud waste and 6 effective ways to reduce it

Cloud waste occurs when cloud resources are unutilized or underutilized. Resource under-utilization occurs when more resources are procured than are actually needed by virtual machines (VMs) at runtime. Cloud providers continue to charge for these provisioned resources regardless of whether they are used or not, resulting in unchecked expenditure.

Out-of-box OpenTelemetry-powered Kafka & Celery monitoring

Messaging queues power modern distributed systems, handling background tasks, event-driven architectures, and real-time data streaming. However, debugging issues in Kafka and Celery queues has traditionally been a black box, with limited correlation between message producers, consumers, and broker metrics. With OpenTelemetry-powered Kafka & Celery monitoring, SigNoz introduces the industry's first fully integrated observability solution for messaging queues powered by OpenTelemetry.

The Best API Monitoring Tools in 2025: A Complete Guide

Imagine its Black Friday and your e-commerce platform suddenly stops processing payments. The culprit? A critical API connection to your payment processor has failed, and you had no idea until angry customers started flooding your support channels. By the time your team identifies and fixes the issue, you’ve already lost thousands in potential sales and damaged your brand reputation.

What is Network Response Time & How to Monitor It

In a world where every second counts, one crucial metric that often flies under the radar is: Network Response Time. You might be wondering, "What exactly is network response time, and why should I care about it?" In this blog post, we're going to break down the concept of network response time into digestible bits (pun intended), and we'll explore why it's a game-changer for businesses of all sizes.

Caution: High Value Information #webinar #sre

Join us for an exclusive webinar with Ben Good from Google as we explore the findings in the 2024 State of DevOps report. For over a decade, the DORA report has provided critical insights into the capabilities and practices that fuel high-performing technology organizations. This report highlights the significant impact of AI on software development, explores platform engineering’s promises and challenges, and emphasizes user-centricity and stable priorities for organizational success.

Server Monitoring with Graphite

Server monitoring is crucial to learn these days to use your servers efficiently. It helps optimize the performance of a server and diagnose issues productively. One useful tool used these days is Graphite, which helps monitor a server’s performance and provides graphing solutions by gaining valuable insights into your server. You can explore MetricFire’s Hosted Graphite service today by signing up for a free trial or booking a demo session.

How to Troubleshoot An Internet Local Loop Issue | Obkio Use Case Series

Is your Internet connection acting up? In this video, we’ll walk you through how to identify and troubleshoot an Internet Local Loop issue using Obkio’s Network Performance Monitoring tool. Learn how to pinpoint the root cause of connectivity problems and ensure a reliable network for your business. What You’ll Learn: What an Internet Local Loop is How to detect Local Loop issues How Obkio helps you troubleshoot network problems.

How to cut costs for metrics and logs: a guide to lowering expenses in Grafana Cloud

Observability is essential to maintaining system reliability, but as your infrastructure scales, so do your costs. Between metrics and logs, managing telemetry data can become overwhelming and expensive. Grafana Cloud is already designed to be cost-efficient, but scaling can still present cost challenges. The good news? Grafana provides robust tools and best practices to help optimize observability data and rein in spending.

Integrate AppSignal with AWS Fargate in Python Flask

In this tutorial, we’ll show you how to integrate AppSignal with a Flask application running on AWS Fargate. Fargate is a serverless container service that allows you to run Docker containers in the cloud. By integrating AppSignal with AWS Fargate, you can monitor the performance of your Flask application and get insights.

ELK vs New Relic: Which Monitoring Tool Should You Choose in 2025?

Effective observability is crucial for maintaining system performance and reliability. ELK Stack and New Relic are two widely used solutions that offer distinct approaches to monitoring, tracing, and logging. This comparison will help you understand their core features, use cases, and strengths, enabling you to make a more informed decision on which tool best aligns with your organizational goals. Lets get started!

How to Filter Docker Logs with Grep

Managing logs in Docker can quickly become overwhelming, especially when dealing with multiple containers. If you’ve ever tried to sift through a sea of log entries looking for a specific error or debugging message, you know the struggle. Fortunately, you can pipe docker logs output through grep to filter logs efficiently. This guide breaks down how to use docker logs grep it effectively, including practical examples to help you debug and monitor your containerized applications like a pro.

Ubuntu System Logs: How to Find and Use Them

System logs play a crucial role in debugging and monitoring in Ubuntu. When a service misbehaves or an unexpected crash happens, logs hold the answers. They’re also great for keeping an eye on system performance. Knowing how to access, read, and manage these logs can save you hours of troubleshooting. This guide covers everything you need to know about Ubuntu system logs—from where they’re stored to how to analyze them efficiently.

eG Innovations' AIOps-Powered APM

I recently wrote about how eG Innovations AIOps-powered monitoring benefits those working with Digital Workspaces – today I’ll cover how those same AIOps (Artificial Intelligence for IT Operations) capabilities also make the eG Enterprise platform a leader in the APM (Application Performance Monitoring) space. The eG Enterprise platform is equipped with capabilities for automated corrective actions, event-based triggers, and remote-control functionalities.

Deeper Trace Analytics - Quickly search through all spans, entry spans and root spans

Debugging distributed systems can often feel like searching for a needle in a haystack. When issues arise, devs need faster ways to pinpoint critical spans within their traces. With our latest Deeper Trace Analytics update, we now enable powerful filtering for root and entry spans — making it significantly easier to analyze and debug distributed traces.

Reducing the Costs and Operational Overhead of Kafka Infrastructures

Kafka is powerful. No doubt about it. But it’s also a beast when it comes to operational complexity and cost. What starts as a simple deployment quickly turns into a resource-hungry system that eats up engineering hours, compute power, and budget. Let’s consider a company that eagerly rolls out Kafka to streamline event streaming. Year one? Smooth sailing. Everything runs fine, and the team feels great. Year two? The cracks start to show.

Deeper Trace Analytics - Analyze Root & Entry Spans with Ease | SigNoz Launch Week 3.0 Day 4

Debugging distributed systems can often feel like searching for a needle in a haystack. When issues arise, devs need faster ways to pinpoint critical spans within their traces. With our latest Deeper Trace Analytics update, we now enable powerful filtering for root and entry spans — making it significantly easier to analyze and debug distributed traces.

The top 5 network security threats every CIO should know in 2025

During a routine network check, your network bandwidth monitoring tool flags an unusual spike in bandwidth usage from a critical server. Further investigation reveals an unauthorized data transfer attempt originating from a misconfigured device. What would have happened if the IT team did not have a monitoring tool to identify the spike? Without the right tools, this simple red flag could escalate into a costly disaster: ransomware, compliance fines, or even operational paralysis.

Getting started with SCOM dashboards

In this blog, we will use the SquaredUp Cloud SCOM plugin to connect to our SCOM Management Group and take a look at what we get out of the box. SquaredUp Cloud is a data visualization tool that can connect to 70+ data sources – perfect for bringing varied data together in a single pane of glass. Display your SCOM data alongside other important metrics.

The Modern Data Center: How AI is Reshaping Infrastructure

The traditional data center is undergoing a dramatic transformation. As artificial intelligence reshapes industries from healthcare to financial services, it’s not just the applications that are changing—the very infrastructure powering these innovations requires a fundamental rethinking. Today’s data center bears little resemblance to the server rooms of the past.

Reducing the Costs and Operational Overhead of Apache Kafka Infrastructures

The Hidden Costs of Apache Kafka Apache Kafka is powerful. No doubt about it. But it’s also a beast when it comes to operational complexity and cost. What starts as a simple deployment quickly turns into a resource-hungry system that eats up engineering hours, compute power, and budget. Let’s consider a company that eagerly rolls.

OpenTelemetry-Powered Infrastructure Monitoring - SigNoz Launch Week 3.0 Day 1

Today, we’re excited to announce a much-awaited feature in SigNoz: Infrastructure Monitoring. With our latest OpenTelemetry-powered Infra Monitoring, we bring you a native OpenTelemetry experience that seamlessly integrates infrastructure metrics with application performance data.

Early Warning in AIOps from HEAL Software: The Key to Preventing Downtime

The answer is yes. But, as with any AI solution, the reality is more nuanced. At HEAL Software, we have spent years perfecting our Early Warning feature by analyzing anonymized data from thousands of global customers and collaborating with IT leaders across industries. AIOps isn’t just a buzzword—it’s a necessity for modern enterprises looking to minimize downtime and enhance operational efficiency.

What is Synthetic Monitoring: The Secret Sauce to Network Monitoring

Picture this: You're the IT manager at a large company, and you're responsible for ensuring that your network is running smoothly. But how do you know if everything is working as it should be? You could wait for someone to report a problem, but that's reactive and not ideal. You could monitor your network constantly, but that's impractical and time-consuming. So what's the solution? Enter synthetic monitoring, the secret sauce to network monitoring.

Distributed Tracing 101: Definition, Working and Implementation

Modern applications rely on microservices, making it tough to track issues across services. Distributed tracing helps by mapping a request’s journey and pinpointing latency, failures, and dependencies. Unlike traditional monitoring, tracing connects the dots between services, offering deeper visibility. But implementing it isn’t easy—it brings high data volumes, performance overhead, and complexity.

AWS CSPM Explained: How to Secure Your Cloud the Right Way

As organizations expand their AWS footprint, maintaining visibility and control over configurations can be challenging. Misconfigurations, unnoticed vulnerabilities, and compliance gaps can create serious security risks. AWS Cloud Security Posture Management (CSPM) helps teams navigate these challenges by automating security checks, ensuring compliance, and providing continuous monitoring. Here’s what you need to know about AWS CSPM and why it’s essential for securing your cloud environment.

Monitoring Kubernetes Resource Usage with kubectl top

Efficient resource utilization is key to running Kubernetes workloads smoothly. Whether you're troubleshooting performance issues, optimizing resource requests and limits, or keeping an eye on cluster health, the kubectl top command is an essential tool. It provides real-time CPU and memory usage metrics for nodes and pods, helping you make informed decisions about scaling and resource allocation.

Never Stand Watch Alone: Apica is the Always-On Partner for SREs

As we navigate through 2025, Site Reliability Engineers face unprecedented challenges in maintaining system reliability and performance at scale. With the rapid evolution of distributed systems, containerization, and AI-driven operations, SREs need more sophisticated tools than ever to successfully do their job as serving as grid guardians.

Stop Losing Sales! The Biggest UX Friction Traps in eCommerce

Friction in eCommerce is a silent sales killer. When customers hit roadblocks—slow pages, confusing layouts, unnecessary steps—they ditch their carts and move on. The problem? Many online stores create friction without even realizing it. But here’s the deal: Not all friction is the same. Some comes from clunky tech, while other issues stem from poor design choices or pushy sales tactics.

9 Reasons Your Business Needs Continuous Network Monitoring in 2025

Numerous technological advancements have made it easier to conduct financial transactions and business. However, cyber-attacks and network inefficiency remain a threat. That’s why your business needs continuous network monitoring. Keeping constant watch over the IT infrastructure of your business is crucial for its survival. It would be very disappointing for your thriving enterprise to come crashing down due to easily thwarted threats that went unnoticed.

Traces Without Limits - Load a Million Spans with SigNoz

Observability at scale is challenging—especially when dealing with high-volume distributed traces. Traditional tracing tools struggle with large traces containing thousands of spans, often leading to sluggish UIs and an unmanageable debugging experience. Most tracing tools we checked have a limit on the maximum spans they can load for a single trace. But with SigNoz, we’ve redefined what’s possible.

Why LogicMonitor is best for network monitoring

As modern networks evolve into intricate ecosystems spanning on-premises, cloud, and hybrid environments, the need for a robust, scalable monitoring solution has never been greater. Organizations face the challenge of maintaining performance, minimizing downtime, and managing ever-increasing complexity.

5 Ways to Avoid Alert Fatigue in Network Monitoring

Alert fatigue is the silent productivity killer in IT operations, and its impact is more significant than you might think. A 2023 survey by CloudHealth Technologies found that 63% of organizations deal with over 1,000 cloud infrastructure alerts every single day. 22% report receiving more than 10,000 alerts each day. This highlights the critical need to minimize alert fatigue.

Logz.io Open 360 Platform Overview

Welcome to Logz.io, where we make monitoring, troubleshooting, and optimizing your systems easier than ever. Our AI-driven observability platform helps you: Ingest and manage your logs effortlessly Analyze and visualize data with powerful filtering & alerting Pinpoint root causes instantly with AI-powered RCA Optimize observability costs with DataHub Ensure peak system performance with Kubernetes 360 & App 360.

Out-of-the-box OpenTelemetry-powered Kafka & Celery monitoring | SigNoz Launch Week 3.0 Day 3

Today, we are excited to announce OpenTelemetry-powered messaging queue monitoring in SigNoz. Debugging issues in Kafka and Celery queues has traditionally been a black box, with limited correlation between message producers, consumers, and broker metrics. With our messaging queue monitoring, teams can correlate Kafka broker metrics with OpenTelemetry spans, enabling deep insights into consumer lag, throughput, drop rates, and performance bottlenecks.

Challenges and Best Practices for Monitoring SaaS-based Businesses

SaaS offers convenience, scalability, and ease of access which makes it a powerful choice for businesses of all sizes. However, monitoring SaaS applications presents unique challenges that can impact performance, security, and user experience if not handled properly. To maintain client trust and meet Service Level Agreements (SLAs), SaaS providers must implement a proactive monitoring strategy.

How to Choose the Right Network Monitoring Tool: 7 Essential Factors

Half of all server failures lead to staff working overtime, driving up costs and highlighting the critical need for effective monitoring. This underscores the importance of choosing the right network monitoring tool. It is a critical decision that impacts not only how well your infrastructure performs today but also how easily it can scale and adapt in the future. A comprehensive monitoring solution needs to balance deep technical capabilities with ease of use and scalability.

Think proactive monitoring for Teams Phone is too good to be true? Think again.

Collaboration platforms like Microsoft Teams are absolutely central to how enterprises get business done these days. But sometimes the fastest, most direct way to answer a question, solve a problem or make a connection is still to pick up the phone and call. The value of solutions like Microsoft Teams Phone is that they offer the best of both worlds: the simplicity and efficiency of voice communication integrated with digital collaboration tools and capabilities.

Resolving Redis connection issues with comprehensive log review

Redis is a highly efficient, versatile in-memory data store that is commonly utilized in modern applications. However, like any technology, it is not without its challenges, particularly when it comes to managing connections. By systematically reviewing Redis logs, you can diagnose and resolve these problems effectively. This blog provides an overview of Redis logs, explores their importance, and highlights how log management tools can simplify troubleshooting.

Resolving Kafka consumer lag with detailed consumer logs for faster processing

Apache Kafka is a distributed event streaming platform designed to handle large volumes of real-time data. It is widely used for messaging, logging, event processing, and real-time analytics. Kafka is known for its ability to handle high throughput, fault tolerance, and scalability, making it an essential tool for modern data-driven applications. Kafka operates with three main components: Latency refers to the time delay between when a message is produced and when it is consumed.

Understanding the Observability Data Lifecycle: From Data Ingestion to Automated Actions

Modern IT estates are increasingly complex, generating vast amounts of data – some critical and actionable, but much of it mere noise. Extracting meaningful insights to ensure optimal system health and IT performance is beyond the scope of humans. This is where observability, enhanced by AI and automation, becomes essential.

The Three Pillars of Network Monitoring: A Holistic Strategy

To truly safeguard your infrastructure, it’s crucial to adopt a holistic strategy that covers every aspect of your network’s health and performance. This means integrating fault monitoring, performance monitoring, and availability monitoring into a comprehensive strategy. Lets discuss how a well-rounded approach to network monitoring can help you maintain resilience, optimize performance, and prevent downtime.

Monitor Google Cloud: simplify and centralize your cloud provider observability with Grafana Cloud

Organizations increasingly rely on Google Cloud to power critical parts of their businesses, but managing those environments often involves navigating a labyrinth of disparate data, tools, and processes. We built Google Cloud Observability in Grafana Cloud to reduce the complexity and confusion by providing a unified, scalable solution designed to simplify monitoring, enhance visibility, and optimize costs.

Your App Might Be Down; Let's Fix It - Introducing Sentry Uptime Monitoring

Even at Sentry, we're not immune to downtime. In a moment of "oh-the-irony," we once took down our own application with a bad migration. We were adding a field to a critical database table, and the migration locked it completely. Since this table was essential to Sentry’s operation, the entire app went down. The website wouldn’t load, ingestion paused—everything ground to a halt.

Right Data, Right Now: Why Timely, Actionable Network Observability is Essential

For teams in many organizations, the work of IT and network management keeps getting more difficult. A recent EMA survey offers some findings that clearly illustrate this point. When respondents were asked which networking skills are the most difficult to find, several roles received a response of 30% or more, including network security, network monitoring and troubleshooting, and data center networking.

NiCE VMware Management Pack 5.8

Great news for all VMware users! NiCE just rolled out VMware Management Pack 5.8 for Microsoft SCOM, bringing full support for VMware vSphere 8.0.1, 8.0.2, and 8.0.3. This update keeps your monitoring sharp and up-to-date with the latest VMware environments. Plus, we’ve polished up the docs to make life easier. If you’re an existing customer, this update is ready for you! Stay ahead of the game and keep your virtual environments running smoothly!

Log Levels: Answers to the Most Common Questions

Logging is essential for understanding what’s happening inside your software. It helps developers and operators catch issues, monitor system health, and track application behavior. A big part of logging is log levels—these indicate how serious a message is, from routine updates to critical errors. In this post, we’ll break down everything you need to know about log levels, how they compare to Syslog log levels, and best practices for making the most of your logs.

The Ultimate Guide to OpenTelemetry Visualization

Modern software systems are complex, with multiple services interacting across different environments. Understanding how they behave—tracking performance, identifying bottlenecks, and diagnosing failures—requires more than just collecting data. OpenTelemetry provides a standardized way to gather logs, metrics, and traces, but the real value comes from making that data easy to interpret through visualization.

OpenTelemetry-Powered Infrastructure Monitoring

Today, we’re excited to announce a much-awaited feature in SigNoz: Infrastructure Monitoring, built natively on OpenTelemetry. Infrastructure monitoring is a critical aspect of modern observability. Without proper visibility into your infrastructure resources, troubleshooting issues, optimizing costs, and maintaining performance become challenging.

Query the Latest Values in Under 10ms with the InfluxDB 3 Last Value Cache

As part of the InfluxDB 3 Core and InfluxDB 3 Enterprise public alpha, the Last Value Cache (LVC) is available for testing. The LVC lets you cache the most recent values for specific fields in a table, improving the performance of queries that return the most recent value of a field for specific time series or the last N values of a field, typical of many monitoring workloads. With the LVC, these types of queries return in under 10ms.

Cloud Monitoring's Blind Spot: The User Perspective

The evolution of internet-centric application delivery has worsened IT's visibility gaps into what impacts an end user's experience. This problem is exacerbated when these gaps lead to negative business consequences, such as loss of revenue or lower Net Promoter Scores (NPS). The need to address this worsening visibility gap problem is reinforced by Gartner’s recent publication of its first Magic Quadrant for Digital Experience Monitoring (DEM).

What is Platform Engineering and Why is it Important?

Without the right frameworks in place, software development often feels like managing a project with too many moving parts and no cohesive plan. A good solution to this problem would be having a unified platform that streamlines processes, integrates tools, and provides consistency across the development lifecycle. That’s what platform engineering offers—it simplifies the complexities of software development by making it easier to build, deploy, and maintain digital infrastructure.

Stop Logging the Request Body!

With more and more people adopting OpenTelemetry and specifically using the tracing signal, I’ve seen an uptick in people wanting to add the entire request and response body as an attribute. This isn’t ideal, as it wasn’t when people were logging the body as text logs. In this blog post, I’ll explain why this is a bad idea, what are the pitfalls, and more importantly, what you should do instead.

New Relic vs Kibana: A Guide to Choosing the Right Tool in 2025

New Relic and Kibana are popular monitoring and observability tools that provide a wide range of features for analysing and visualizing data. In this post, I have compared New Relic and Kibana based on key aspects such as data ingestion, dashboards and visualizations, log management, alerting, pricing and more. Lets take a look at each tool's capabilities, strengths, and weaknesses to help you understand how they differ and which one is best suited to your needs.

Grafana Beyla 2.0: distributed traces, scalable Kubernetes deployments, and more

In November 2023, we released Grafana Beyla 1.0, the first major milestone in our pursuit of zero-code (and zero-effort) eBPF instrumentation. We delivered a way — through a single command-line — to automatically instrument any application supporting HTTP/gRPC protocols, as well as provide basic network packet flow information.

Observe Your Google Cloud Infrastructure | Demo: New Grafana Cloud Application | Grafana Labs

Want to monitor your Google Cloud infrastructure more effectively? Join Vasil Kaftandzhiev as he introduces Grafana Cloud’s new application designed specifically for Google Cloud observability. In this video, you'll discover how to: Optimize and troubleshoot your Google Cloud services Leverage out-of-the-box dashboards with key metrics and thresholds Set up comprehensive alerting for real-time incident response Streamline log management with an all-in-one logs view for faster root cause analysis Configure logs and metrics effortlessly using Grafana Alloy.

How To Monitor Kubernetes with Splunk Infrastructure Monitoring

Kubernetes is the standard for orchestrating containerized microservices — but it can present some monitoring challenges. Luckily, we’ve already covered why monitoring Kubernetes is a must-do, the basics of how to do it, and the options you have for collecting monitoring data from a K8s environment.

Solve Problems Faster with New, Smarter AI and Integrations in Splunk Observability

As businesses scale across hybrid and multi-cloud environments and integrate AI-powered technologies, complexity grows — and with it, the risk of performance degradation and cost of downtime. To avoid facing customer-impacting IT issues, organizations need better ways to correlate data across environments, detect anomalies before they escalate, and resolve incidents more efficiently. That’s where Splunk and Cisco come in.

From Datadog to Grafana Cloud: Why companies migrate and how it changes business for the better

“Impossibly expensive.”“Generic database metrics.”“Exceeding limits.”“No transparency.” These are the words our customers use to explain why they looked for a Datadog alternative and migrated onto Grafana Labs’ observability solutions. Grafana Cloud provided the scalability that LexisNexis Risk Solutions needed to migrate acquired companies into a unified observability platform. “We’ve had migrations from Datadog.

How to visualize user journeys with Site24x7 to spot opportunities to improve the UX

Before judging anyone, walk a mile in their shoes. This is a great idiom that emphasizes the importance of experiencing what your customers experience when you offer a service. With empathy, IT product owners can ensure that their operations take into account user journeys to be responsive and responsible.

4 Best Backlink Indexing Tools in 2025 (SEO Indexers Guide)

Website owners and SEO professionals need effective tools to speed up search engine recognition of their backlinks. This review analyzes the leading backlink indexers available in 2025, testing their performance metrics, costs, and technical capabilities. We've tested Giga Indexer, Rapid URL Indexer, Backlink Indexing Tool, and Indexceptional to determine how well each one handles the indexing process.

Coralogix Releases eBPF Observability for K8s Workloads

There are several big barriers to an effective tracing strategy. Modern applications require complex code instrumentation, and legacy applications might not be so easy to alter, and that’s assuming every engineering team can be engaged to make the necessary changes. eBPF & OpenTelemetry flip this entire problem on its head, and Coralogix is one of the first major observability platforms to leverage this exciting functionality, to provide an unobtrusive, low risk overview of your system.

Unveiling Azure's Hidden Costs: What You Need to Know

So, you’re new to the cloud or just starting off with Azure. You’re probably starting your first project and using the Azure Calculator to help estimate your monthly run rate. The problem is that Azure, like all clouds, has hidden costs. So why does the cloud have hidden costs? Well, while we call them hidden costs, it’s really more a matter of unexpected costs or unknown costs.

Cloud storage: Walkthrough, challenges and solutions

Cloud storage has become an integral part of enterprise IT infrastructure. Cloud engineers, SREs, SysAdmins, and CTOs are always on the look out for more avenues to keep their organization's data secure, accessible, and managed. In this blog post, let us explain cloud storage in detail, the associated challenges, and how to overcome them.

Strategic IP address management (IPAM): A must-have solution for high volume networks

Managing enterprise IT infrastructure isn’t just about staying afloat—it’s about being one step ahead with strategic IP address management in modern enterprise IT. Each day, IT teams grapple with network sprawl, security challenges, and the constant demand for scalability. But here’s a question: how does your enterprise manage its IP address space? If your answer is “manually” or “through spreadsheets,” it’s time to rethink your approach.

Navigating AWS policy changes in 2025: The role of CloudSpend in mitigating the impacts

AWS SP/RI policy changes In a significant move, AWS announced policy changes to the use of Reserved Instances (RIs) and Saving Plans (SPs), which are set to take effect on June 1, 2025. These changes are particularly crucial for MSPs, resellers, and other organizations that rely on shared RIs and SPs to manage cloud costs.

Managed OpenSearch: Pricing and How Logit.io is the Best Value

If you’re considering OpenSearch for your search and analytics infrastructure, the first question that likely comes to mind is: what will it cost? OpenSearch, the powerful, open-source search engine and analytics platform, provides a highly scalable solution for businesses. However, while the software itself is free to use, there are still costs associated with hosting, maintaining, and scaling OpenSearch clusters.

Why Cybersecurity Asset Management is Crucial for Cyber Hygiene

The concept of managing IT assets for security purposes has been around since the earliest days of computer networks in business. However, the term “Cybersecurity Asset Management (CAM)” itself is relatively new, however, Teneo have been opening minds to CAM for some time now, here is a summary of what it is and why it’s so important as part of maintaining good Cyber Hygiene.

Casio UK Hit With Payment Skimming Attack

In early February 2025, reports emerged of a sophisticated web skimming attack that compromised the UK website of electronics manufacturer Casio, and at least 16 other ecommerce sites. This Magecart-style breach led to the theft of customers’ personal and payment information, highlighting the persistent threat of digital skimming to online retailers. Image Source: Casio UK Website.

How Azure Observability Optimizes Performance and Monitoring

Observability in Azure isn’t just about tracking metrics—it’s about truly understanding how your cloud infrastructure, applications, and services are performing. It helps you spot issues before they become problems, optimize performance, and ensure security. In this guide, we’ll break down Azure Observability in a way that’s easy to follow, covering key concepts, best practices, and some useful tricks to give you an edge.

Top 5 outages detected by StatusGator in January

StatusGator continues to deliver crucial early warnings for major service disruptions, detecting outages before official acknowledgment. Below, we highlight major incidents from January 2025, where StatusGator’s real-time monitoring kept users informed and helped minimize workflow disruptions.

How to Troubleshoot Networks with Employees Working from Home

With employees working from home, often relying on personal Internet connections and consumer-grade equipment, IT teams face a new set of challenges in ensuring seamless connectivity. Unlike traditional office environments, where networks are controlled and optimized, home networks are unpredictable and prone to a variety of issues – from slow Internet speeds to intermittent connectivity.

Everything You Need to Know About Microsoft Sentinel Pricing

Keeping your organization secure is more important than ever. Microsoft Sentinel, a cloud-native Security Information and Event Management (SIEM) solution, helps detect and respond to threats effectively. But to get the most out of it, it’s important to understand how the pricing works.

NiCE AIX Management Pack | 5 Minutes Explainer Video

This short video will give a quick overview of the main features of the NiCE AIX Management Pack, such as Discovery, Monitors, Advanced Product Knowledge, Tasks, Performance Views, Reporting, and, of course, Security aspects of advanced AIX monitoring on Microsoft System Center Operations Manager.

Top 15 PostgreSQL Monitoring Tools in 2025

Imagine your PostgreSQL database slowing down unexpectedly, causing delayed queries, application timeouts, and frustrated users. In a data-driven world, database performance issues can lead to downtime, lost revenue, and poor user experience. PostgreSQL monitoring tools help detect and resolve performance bottlenecks, ensuring optimal database health, efficient query execution, and high availability.

Frontend Monitoring: Deliver Seamless and Performant User Experiences

88% of online consumers are less likely to return to a site after a bad user experience. This means that addressing frontend issues such as slow load times, broken features, and unresponsive elements is crucial. Frontend monitoring helps development and IT teams proactively catch and resolve these issues to improve their user experience.

Elastic Cloud Serverless now available in technical preview on Microsoft Azure

Elastic Cloud Serverless provides the fastest way to start and scale security, observability, and search solutions — without managing infrastructure. Today, we are excited to announce the technical preview of Elastic Cloud Serverless on Microsoft Azure — now available in the EastUS region. Elastic Cloud Serverless provides the fastest way to start and scale security, observability, and search solutions without managing infrastructure.

Why observability needs FinOps, and vice versa: the Vantage integration with Grafana Cloud

Ben Schaechter is co-founder & CEO of Vantage, a cloud cost management platform that provides actionable insights for every engineer. Observability tools have changed the way we monitor infrastructure and applications, as teams get complete visibility into performance across complex, multi-cloud environments. But as all that infrastructure scales, costs rise with it, and organizations are left to ask: Where are my costs going—and why?

The role of Redis monitoring in scaling applications for high-traffic environments

High-traffic applications demand speed, reliability, and scalability, making Redis a top choice for tasks like caching and real-time analytics. However, as traffic grows, ensuring Redis operates at peak performance requires effective monitoring. By tracking key metrics, addressing bottlenecks, and optimizing resource use, Redis monitoring plays a vital role in maintaining stability and scalability.

Top 10 challenges for SREs and how to overcome them with APM tools

According to Google, "SRE is what you get when you treat operations as a software problem.” The role of site reliability engineers (SREs) is evolving rapidly to ensure optimal application performance in today's evolving IT environments. SREs are expected to provide proactive and predictive solutions for the issues arising from managing such environments. A Gartner report even suggests that by 2025, 70% organizations will be depending on SRE practices to ensure operational resilience.

Migrating to Amazon DaaS - Part 1 - How to leverage AIOps monitoring during a migration to Amazon WorkSpaces or AppStream 2.0

If you are considering or planning a migration to Amazon Workspaces or AppStream 2.0, you’ll also want to consider how you integrate effective monitoring into your planning and execution – this will not only save you time and money long term but will also help you measure and achieve success.

Virtana in Gartner Research 2024: A Mark of Excellence in Infrastructure Observability

Research and analysis by Gartner¹ carries significant weight in the technology industry, serving as a trusted source of insights for IT decision-makers worldwide. Their rigorous evaluation processes and comprehensive market analysis help organizations make informed technology investments. When a company is featured across multiple Gartner research publications, it demonstrates market relevance and solution maturity.

Access your data with Federated Analytics for Amazon Security Lake. Insights from Splunk, AWS, and A

Federated Analytics gives organizations the full power of Splunk extended to data stored in Amazon Security Lake. Trusted partners like Accenture are helping bring these new capabilities to life at organizations around the world.

What Does Low Network Bandwidth Mean & How to Fix It

Network performance is critical for everything from streaming videos to running cloud applications. But what happens when your network feels sluggish, and tasks that should take seconds suddenly take minutes? The culprit could be low network bandwidth. In this article, we’ll break down what low bandwidth means, how it affects your network, and actionable steps to fix it.

Getting started with PowerShell dashboards

SquaredUp is a flexible dashboard and analytics platform that makes it really easy to turn your PowerShell scripts into dashboards that you can use for monitoring or sharing. In this article we’ll take a look at getting started with the PowerShell plugin for SquaredUp and build our first dashboard. Sign up for a free account if you’d like to follow along.

The Role of Log Monitoring in Securing Hybrid Cloud Infrastructures

Hybrid cloud services have become a cornerstone for many businesses. These technologies, which combine the strengths of private and public clouds, assist enterprises in achieving their dreams of scalability, flexibility, and cost-efficiency. However, this added optimization comes at a cost, particularly with increased operational complexity and security concerns. To minimize cyber threats and secure their data, businesses must invest in more security solutions, such as log monitoring.

Latest Product Updates and Features in Logz.io | February 2025

We’re excited to announce a series of upgrades to our AI Agent, Log Management Explore UI and core integrations designed to empower you with even deeper observability and streamlined operations. These updates enhance account visibility, multi-telemetry trace insights, and logging capabilities while ensuring seamless compatibility with OpenTelemetry. Read on to discover how these enhancements can help you gain more clarity and control over your environment.

Generation AI (Episode 2): How Generative AI is Shaping the Future of Security Operations

The next golden age of artificial intelligence has arrived, but the path forward is far from certain. Technology leaders are presented with a tremendous opportunity to revolutionize their business — that is, if they can find a way to tap into the full potential of their organization's data. In Episode 2 of Elastic's new limited series, Generation AI, Elastic's CISO, Mandy Andress, shares how she believes generative AI will shape the future of the security operations in the modern enterprise.

Generation AI (Episode 1): How Generative AI is Shaping the Future of Enterprises

The next golden age of artificial intelligence has arrived, but the path forward is far from certain. Technology leaders are presented with a tremendous opportunity to revolutionize their business — that is, if they can find a way to tap into the full potential of their organization's data. In Episode 1 of Elastic's new limited series, Generation AI, Elastic's CIO, Matt Minetola, shares how he believes generative AI will shape the future of the modern enterprise.

AWS Monitoring Trends 2025

Discover the top trends shaping AWS monitoring in 2025! From AI-powered predictive analytics to sustainability-focused tools, this video dives into the innovations driving the future of cloud infrastructure. Topics Covered: Stay ahead in the evolving cloud landscape with these key trends. Watch now to learn how to achieve smarter, faster, and more sustainable AWS monitoring in 2025 and beyond! Subscribe for more cloud insights!

Monitor Amazon Kinesis Firehose in Hosted Graphite

We’ve supported syncing your metrics from Kinesis Streams, Amazon’s streaming data platform, for several years. Kinesis Streams helps you gather and process streaming data which can then be monitored in your Hosted Graphite account. Recently, we’ve added support for Firehose, a fully managed and scalable service that allows users to stream data to destinations like Amazon Simple Storage Service (Amazon S3), Amazon Redshift, or Amazon Elasticsearch Service (Amazon ES).

Real User Monitoring for B2B vs. B2C Businesses

Imagine you’re a product manager at a B2B SaaS company. Monday morning, a frustrated client floods your inbox—their workflows were disrupted by a slowdown you could’ve caught sooner with better user insights. Now, imagine running an e-commerce store on Cyber Monday. Traffic surges, but abandoned carts spike. Your RUM dashboard reveals slow mobile checkouts. A quick fix saves thousands in sales.

Generation AI (Episode 3): How Generative AI is Shaping the Future of Customer Support

The next golden age of artificial intelligence has arrived, but the path forward is far from certain. Technology leaders are presented with a tremendous opportunity to revolutionize their business — that is, if they can find a way to tap into the full potential of their organization's data. In Episode 3 of Elastic's new limited series, Generation AI, Elastic's VP of Global Customer Support, Julie Rudd, shares how she believes generative AI will shape the future of customer support.

NGINX Log Monitoring: What It Is, How to Get Started, and Fix Issues

Ensuring that your web applications run smoothly and securely is essential. NGINX, known for its high performance and scalability, plays a key role in delivering web content. But to keep everything running efficiently, you need to monitor and analyze its logs properly. This guide will walk you through how to configure, analyze, and make the most of NGINX logs to stay on top of your server’s health.

Learn to Forecast Time Series Data Using ML & InfluxDB

Forecasting is all about predicting the future—in data science, it is one of the key skills in dealing with time series data, such as stock price prediction, sales forecasting, logistics planning, etc. In this tutorial, we’ll learn how to forecast the notorious weather pattern of London, UK, using the following free and open source technologies.

Beyond monitoring: The power of observability

The demand for seamless user experiences and robust system reliability is at an all-time high, and businesses are racing to meet these expectations. But as system complexity increases, traditional monitoring tools are falling short. Observability offers a paradigm shift. It goes beyond tracking metrics and provides deep insights to understand the “why” behind system behavior by parsing and contextualizing unstructured data.

How to Monitor Error Logs in Real-Time: An In-Depth Guide

For system admins and developers, being able to track error logs in real time is crucial. It’s not just about fixing problems; it’s about keeping everything running smoothly, ensuring systems perform at their best, and catching issues before they snowball into bigger ones. This guide breaks down the tools and commands that make real-time log monitoring easier and more effective, offering more than just the basics.

AWS CloudWatch Custom Metrics: Types & Setup Guide [With Examples]

Amazon CloudWatch is a monitoring and observability service that provides real-time insights into AWS resources and applications. While CloudWatch provides many default metrics, sometimes you need custom metrics to monitor specific aspects of your infrastructure or applications. This guide covers everything you need to know about CloudWatch custom metrics, from basics to advanced use cases.

Using a transformer-based text embeddings model to reduce Sentry alerts by 40% and cut through noise

Sentry uses Issue Grouping to aggregate identical errors and prevent duplicate issues from being created, and duplicate alerts being sent. One of the chief complaints we’ve heard from our users is that in some cases the existing algorithm did not sufficiently group similar errors together, and Sentry would create separate issues and alerts, causing unnecessary disruption–or at least annoyance–to developers.

5 Types of Checks Every Shopify Store Should Have

Running an online store based on Shopify can be a stressful experience. Meeting sales quotas or metrics, ensuring the store’s accessibility, and accessing data on user statistics are all concerns that any Shopify store owner will encounter. Though Shopify provides an excellent solution for sellers, additional monitoring services to ensure that the store is always available can be very helpful.

Introducing WebPageTest Expert Plan: Real-Time Insights, Synthetic + RUM together in One Platform

Imagine this: You push a major update to your website, confident that everything looks great. Hours later, traffic plummets. Your users complain about slow load times, but when you check WebPageTest, everything seems fine. What’s missing? Real-time insights and proactive monitoring.

Revising Icinga Exchange

Icinga is an open-source project, but it’s only become the product we like to use thanks to co-development, brainstorming and suggestions from the community. That’s why we created a platform in the past to facilitate the exchange of custom implementations like check plug-ins, styles, extensions and bridges to third-party systems. We’re talking about our Exchange Portal, of course.

Getting Started with OpenTelemetry Java SDK

Understanding how your applications perform is crucial. OpenTelemetry has emerged as a powerful observability framework, offering a standardized approach to collecting telemetry data such as metrics, logs, and traces. For Java developers, the OpenTelemetry Java SDK provides the tools necessary to instrument applications effectively. This guide is all about the OpenTelemetry Java SDK, exploring its components, configuration, and advanced features to help you harness its full potential.

Announcing Checkly Traces: Unified Synthetic Monitoring and Distributed Tracing

Until recently, Checkly was telling you what broke in your app. Now, it can also tell you why it broke. We're excited to announce the general availability of Checkly Traces, a new addition to our synthetic monitoring platform that bridges the gap between frontend monitoring and backend observability. By combining synthetic monitoring with distributed tracing, Checkly Traces empowers development teams to detect, diagnose, and resolve issues faster than ever before.

Why Observability 2.0 Is Such a Gamechanger

One of the hardest parts of my job is to get people to appreciate just how much of a difference Honeycomb/observability 2.0 is compared to their current way of working. It’s not just a small step up or a linear improvement. Rather, it’s an entire step change in the way that you write, deploy, and operate software for your customers.

Full Guide to Linux Disk IO Monitoring, Alerting and Tuning

Disk IO (Input/Output) is a core aspect of system performance. Whether you’re managing a database, a web application, or a cloud server, how efficiently your system reads and writes data affects everything from response times to stability. Unlike high CPU usage or memory bottlenecks that often manifest immediately, disk IO issues tend to creep up silently—until they slow down critical processes.

How to Stop Memory Leaks Before they Crash Your Linux System

Imagine you’ve got a leaky faucet in your kitchen. At first, it’s just a drip here and there—annoying, sure, but not enough to ruin your day. But leave it unchecked, and soon that drip turns into a steady trickle. Your water bill skyrockets, the sink overflows, and before you know it, you’re ankle-deep in chaos. Now, replace that faucet with a Linux system, and you’ve got a memory leak.

5 Ways to Prevent CPU Overload on Linux Servers

Every server administrator’s nightmare starts with a message: “CPU usage at 100%” It’s that critical moment when your Linux server transforms from a reliable workhorse into a sluggish mess, taking your applications and user experience down. We’ve all been there… staring at a terminal, watching load averages climb, while frantically trying to figure out which process decided to throw a CPU-hungry party on our server.

AppSignal Now Offers Support for Long-Running Streaming Rack Responses in Ruby

We're excited to announce that AppSignal now offers improved monitoring for long-running streaming Rack responses. Our improved Rack response monitoring means you can gain deeper visibility into the health of your Ruby application's long-running responses, allowing you to catch errors that may arise minutes or even hours after a request's body is served. This new layer of observability results from a valuable contribution from Julik Tarkhanov, Director of Engineering at Cheddar Payments.

What Is Network Device Monitoring? Find Out 5 Top Monitoring Tools

Businesses, organizations, and individuals rely on networks to communicate and exchange data. The rapid growth of technology and increasing reliance on networked systems have made robust network performance and security critical. However, maintaining optimal network performance and security is a difficult task. Network failures, security breaches, and performance bottlenecks can result in substantial financial losses and reputational damage. What Is Network Device Monitoring?

Monitoring coffee: Tales from Hosted Graphite's secret lab

It has been said that software engineers are organisms that convert caffeine into code. Not all software engineers need coffee to get by, but it's popular enough that it'd be silly for us not to have an office coffee machine... …it'd also be sort of silly for a monitoring company not to monitor that coffee machine, which is so crucial that we could make a reasonable argument for it being part of the production infrastructure.

Locking Down PostgreSQL with SSL: Secure Remote Connections Like a Pro

PostgreSQL is a beast when it comes to handling data, but if you're running an instance that needs to be accessed remotely, securing it with SSL is non-negotiable. Without SSL, your database connection is essentially an open book for anyone snooping on the network. Let’s lock it down with properly signed certificates!

Kubernetes Monitoring and Alerting Made Easy with Splunk Observability Cloud and OpenTelemetry

In this video, I'll show you how to quickly setup monitoring and alerting for your Kubernetes clusters using Splunk Observability Cloud. We’ll start by deploying the Splunk OpenTelemetry Collector using Helm, and then use the Kubernetes Navigator inside Splunk Observability Cloud to view the health of our cluster and the applications it’s hosting. I’ll demonstrate AutoDetect detectors and alerts by intentionally triggering an issue in the cluster and walk through the alerting process. We’ll review the alerts in Splunk Observability Cloud and then resolve the issue in the cluster.

Getting Started with M365 dashboards

SquaredUp is a flexible dashboard and analytics platform that makes it really easy to dashboard your M365 and Intune usage and analytics. You can then use it for monitoring or sharing! In this article we’ll take a look at getting started with the M365 plugin for SquaredUp and building our first dashboard. Sign up for a free account if you’d like to follow along.

How AI-powered anomaly detection is transforming APM for SREs

Site reliability engineers (SREs) often face challenges in keeping an organization’s sites running smoothly as the complexity of distributed systems steadily increases. With the rise of microservices, cloud-native architectures, and massive data volumes, manual monitoring and troubleshooting are no longer sustainable. SREs must navigate hurdles like alert fatigue, incident response delays, and the constant pressure to maintain system reliability.

Top 5 EdTech outages detected by StatusGator in January 2025

Educational platforms are essential for students, educators, and institutions, making service disruptions especially impactful. StatusGator’s early detection ensures that users receive timely alerts before official acknowledgments, helping them navigate unexpected downtime. Below, we recap significant education-related outages from January 2025, where StatusGator kept users ahead of disruptions.

Petabyte Scale, Gigabyte Costs: Mezmo's Evolution from ElasticSearch to Quickwit

At Mezmo, we handle an enormous volume of telemetry data for our customers and ourselves, requiring a robust and efficient search and analytics backend. For years, ElasticSearch served us well, but as our infrastructure grew to a multi-cluster, multi-petabyte scale, we started to see the cracks—rising costs, performance bottlenecks, and scalability concerns. We needed a change, one that would make our system more cost-effective while maintaining speed and reliability.

How to Optimize Costs and Strengthen IT with Teneo's Deep Observability

Teneo understands that it can be hard to balance cost and depth of observability in todays fast-paced digital landscape, where organizations face the challenge of managing increasingly complex IT infrastructures while keeping costs under control. Achieving this balance requires a new approach, this is why we have developed our Open Observability platform, a critical component of Teneo’s StreamlineX framework.

Telemetry Pipeline 101

Are you looking to enhance your observability and gain deeper insights into your systems? Curious about how a Telemetry Pipeline can revolutionize your monitoring and troubleshooting capabilities while keeping the cost low? Join Mezmo’s Bill Balnave (Vice President of Technical Services) for an insightful webinar unraveling Telemetry Pipeline’s key concepts, highlighting its significance in modern software development and operations. Discover how a Telemetry Pipeline enables you to collect, profile, transform, and analyze crucial telemetry data from your applications and infrastructure.

Take Control of Incidents: Smarter Filtering, Collaboration & Insights

Managing incidents just got a whole lot easier. With our latest update, you get better visibility, smarter filtering, and a seamless way to collaborate with your team – right inside UptimeRobot. You can now access all incidents in one place under the new Incidents tab in the left sidebar. FYI: Some of our best improvements come directly from you—like the ability to add comments and make incidents searchable. We listen and we don’t judge—do you have an idea?

How to visualize CSV data with Grafana

While CSV data is often associated with popular spreadsheet apps like Google Sheets or Microsoft Excel, Grafana offers a number of capabilities to quickly visualize and analyze data stored in a CSV format. In this post, we’ll walk through an example of how to use Grafana to visualize any CSV file from anywhere on the web. More specifically, we will: Moving forward, you can also apply these steps to build any kind of dashboard within Grafana.
Sponsored Post

Top 10 .NET exceptions (part one)

Exception handling is essential to.NET development, but not all exceptions are equal. Some, like NullReferenceException, surprise developers with unclear stack traces and production crashes. Others, such as MySQLException or HttpRequestException, often point to issues like resource mismanagement or network failures. At Raygun, we've worked with teams around the world to monitor and fix software issues, giving us deep insight into how exceptions occur and how to handle them effectively.

6 key steps to drive successful network automation in your enterprise

The complexity of modern networks has surged due to digital transformation, hybrid work models, and evolving security threats, making manual management increasingly unsustainable. Network automation addresses this challenge by streamlining operations and enabling networks to adapt and remain resilient in an ever-changing environment. A recent Gartner study predicts that by 2026, 30% of enterprises will automate more than half of their network activities.
Sponsored Post

Introducing Agentic AI Platform by Fabrix.ai

Over the past couple of years, many of us have been utilizing Generative AI interfaces and co-pilots to enhance our communication, conduct research, and summarize complex information. AI-based agents are digital entities created to autonomously derive insights from data and execute actions. Agents are focused on accomplishing a specific outcome without the needfor constant human intervention.

Streamlining Telemetry with Apica's Fleet Management Solution: A Deep Dive

In the rapidly evolving IT environment, observability at scale has become a critical challenge for organizations aiming to maintain operational excellence. The proliferation of telemetry collection agents across diverse infrastructures often increases complexity, resource strain, and configuration inconsistencies.

SLOs: a guide to setting and benefiting from service level objectives

If you’re running a technology-driven business, reliability isn’t optional—it’s essential. But how do you balance speed and innovation with a level of reliability that satisfies your customers? That’s where service level objectives (SLOs) come in. SLOs offer a framework for defining and achieving reliability goals, aligning technical efforts with user needs, and driving meaningful outcomes for your business.

Keeping Spending in Check: Observability's Positive Impact on Cost Management

Tool sprawl within organizations doesn’t just create a fragmented user experience; it poses a real threat to enterprises’ bottom lines. Consider these statistics: This fragmentation significantly limits worker productivity. IT leaders spend hundreds of hours trying to manage multiple tools, map their environments, and upkeep aging systems that are either outdated or simply no longer necessary.

Quickly get rich, actionable context for alerts with Datadog's new Monitor Status page

Providing rich context for monitor alerts is an essential part of any robust, scalable monitoring strategy. Alerts that send teams scrambling for basic background information prolong troubleshooting, hindering effective incident response and heightening the potential for service disruption. Given the increasing complexity of modern, distributed applications, however, breaking down knowledge silos in order to ensure consistent access to critical context for alerts can be a challenge.

Wireless Network Management with Site24x7

Struggling with Wi-Fi connectivity issues? Wireless LAN controllers (WLCs) are the backbone of enterprise networks, but they’re not without challenges. From access point disconnections to overloaded controllers, even small issues can disrupt your operations. With Site24x7, you can proactively monitor and optimize your wireless network. Get real-time insights, detailed analytics, and instant alerts to troubleshoot problems before they impact users.

How to Optimize Website Images: The Complete 2025 Guide

Images are big. Really big. The bytes required for an image dwarf most site’s CSS and JavaScript assets. Slow images will damage your Core Web Vitals, impacting your SEO and costing you traffic. Images are usually the element driving Largest Contentful Paint and load delays can increase your Cumulative Layout Shift. If you’re not familiar with these metrics, check them out in the Definitive Guide to Measuring Web Performance.

Taking a step towards network resilience: The importance of real-time alerts

Is your network prepared to handle unexpected disruptions, or are you constantly in fire-fighting mode? As organizations become increasingly reliant on uninterrupted connectivity, network downtime, slow response times, or undetected vulnerabilities can directly affect customer experience, employee productivity, and even your bottom line. So, how can you proactively address these challenges?

Resolving Heroku deployment issues using comprehensive log data

Deploying applications on Heroku offers a streamlined process for developers, but even the most well-optimized setups can encounter deployment issues. To effectively resolve these issues, it's crucial to gain real-time insights into your app’s behavior, traffic, and performance metrics. The solution to resolving Heroku deployment challenges lies in leveraging the power of log management.

10 Kubernetes Monitoring Tools You Can't-Miss in 2025

Monitoring a Kubernetes cluster isn’t just about keeping an eye on CPU and memory usage. It’s about understanding system health, detecting anomalies before they cause outages, and ensuring applications run smoothly. With so many tools available, choosing the right one can feel overwhelming. This guide covers the best Kubernetes monitoring tools, their use cases, and key factors to consider.

Find and Fix Performance Bottlenecks with Sentry's Trace Explorer

We’ve all worked on that app that hangs just a little too long in weird places, or had that query we could never get to perform just right. The network waterfall in Chrome DevTools can’t quite show us what’s going on behind the scenes, and tracing with OTel (and honestly, tracing in Sentry) was just… hard. Today that changes.

CLI Operations for InfluxDB 3 Core and Enterprise

This blog covers the nitty-gritty of essential command-line tools and workflows to effectively manage and interact with your InfluxDB 3 Core and Enterprise instances. Whether you’re starting or stopping the server with configurations like memory, file, or object store, this guide will walk you through the process. We’ll also look at creating and writing data into databases using authentication tokens, exploring direct line protocol input versus file-based approaches for tasks like testing.

SSHD Logs 101: Configuration, Security, and Troubleshooting Scenarios

Secure Shell (SSH) is a fundamental tool for remote system administration, and its logs play a critical role in security monitoring, debugging, and compliance. SSHD logs provide insights into authentication attempts, connection successes, failures, and potential intrusions. This guide explores everything you need to know about SSHD logs, including their location, format, analysis, and lesser-known security practices to maximize their effectiveness.

Website Performance Benchmarks: What You Should Aim For [with Examples]

When it comes to your website, speed is everything. A slow site frustrates users, drives up bounce rates, and even impacts your revenue. That’s where website performance benchmarks come in. They help you figure out how well your site is performing, where it needs improvement, and—most importantly—what you can do to make it faster. In this guide, we'll walk you through the key benchmarks, the tools you need, and a few tips that’ll help your site outshine the competition.

Top 11 API Monitoring Tools You Need to Know

APIs are the backbone of modern software, quietly powering everything we interact with. But just because they’re invisible doesn’t mean they can’t run into issues. From response times to uptime, keeping an eye on your APIs is key to making sure everything works smoothly. In this guide, we’ll explore 11 popular API monitoring tools to help you find the one that best fits your needs.

How to Set Up Actually Useful SLOs | Introduction to SLOs | Grafana Labs

Service Level Objectives (SLOs) should be more than just numbers on a dashboard—they should help your team deliver real value to your users. In this video, Jake Swiss from Grafana Labs walks you through three simple steps to create SLOs that align with business goals and drive better decision-making. Step 1: Understand What Really Matters – Align SLOs with customer expectations Step 2: Define Clear, Measurable Targets – Use RED metrics (Rate, Errors, Duration) to track meaningful performance Step 3: Continuously Iterate & Fine-Tune – Adjust SLOs based on historical data and team feedback.

How to Overcome Alert Fatigue in Your Alerting System | Introduction to SLOs | Grafana Labs

Cut Through Alert Noise with SLOs! Tired of endless alerts that don’t reflect real issues? SLOs (Service Level Objectives) help reduce noise by focusing on what truly impacts users. Instead of reacting to every minor spike, set SLOs to trigger alerts only when reliability is at risk.

Kentik - Cloud Observability

Kentik Cloud provides comprehensive visibility across all major public clouds, offering seamless insight into cloud-to-on-prem network paths and the public internet routes connecting them. Identify latency, loss, jitter, and application-specific traffic while providing deep visibility into cloud networking constructs like ACLs to spot security issues. With powerful analytics, Kentik Cloud enables you to visualize intra-cloud traffic, identify idle resources for optimization, and leverage historical data to uncover trends and seasonal patterns—ensuring optimal cloud performance and cost efficiency.

Kubernetes 101

When you get behind the wheel of your car, one of the first things you see is the dashboard. Your dashboard provides various information about all the different technologies that make the car run smoothly, like helping you control your speed, providing insight into your fuel levels, and offering suggestions for regular maintenance, like oil changes. For developers, Kubernetes acts as that one-glance dashboard to provide insights about container performance, maintenance needs, and storage requirements.

System Center 2025 Unveiled: Insights and Expert Discussion on SCOM, SCORCH, SCSM, and Beyond

The future of IT operations is here! Join us for an exclusive expert panel discussion on Microsoft System Center 2025, where industry leaders will explore the latest advancements and strategies for optimizing enterprise IT environments.

Top 3 tools for reporting Zendesk metrics

Zendesk is a popular choice for customer service and support, offering a range of tools to manage interactions and boost customer satisfaction. However, making sense of all the data it collects requires robust reporting tools. Zendesk Explore, Power BI, and SquaredUp are three powerful tools that can help you unlock valuable insights from your Zendesk data, but each has its unique strengths.

A Complete Guide on Synthetic Monitoring | How to Improve Your Web & App Performance?

Making your brand stand out in digital business is more challenging than it sounds. More than 2.87 million apps are available on Play store and other platforms. In fact, as per reports, almost 252,000 new websites are created and launched daily. Competing in this large market without proper monitoring and strategy is a complete waste of time.

9 essential metrics to track for effective IT operations with log management tools

Monitoring the correct metrics is crucial for efficient IT operations, as it ensures the smooth functioning of an organization's infrastructure. One crucial aspect of this process is log management, which empowers IT teams to address critical aspects of IT infrastructure, including performance, availability, security, resource usage, and integration.

How To Configure a PostgreSQL Datasource in Grafana

So, you’ve got a PostgreSQL database packed with juicy data, and you want to turn those raw numbers into slick, interactive Grafana dashboards? Good call! Grafana’s PostgreSQL datasource is like the secret handshake that lets you visualize your data in style—no extra ETL magic required. In this guide, we’ll walk through getting PostgreSQL and Grafana to play nice, covering everything from connection settings to query tuning.

January product updates

It’s been a busy month at StatusGator HQ as we focused on improvements to the status page — one of many features that helps you communicate the status of all your cloud services to your stakeholders. Here’s a quick recap of this month’s updates. Let’s take a look at what we’ve rolled out! As a reminder, you can see all these updates here on the blog as they are released or in our product update sidebar inside StatusGator.

Booking.com's Journey to Enhanced Observability

Since its early startup beginnings in Amsterdam, Booking.com has redefined the travel industry, establishing itself as a premier platform for millions of travelers worldwide. With over 28 million accommodation listings and a staggering 1.5 million room nights booked every day, Booking.com operates on a scale that demands a robust and constantly monitored infrastructure.

The Basics of Log Parsing (Without the Jargon)

Logs are crucial for understanding what's happening in your system, but they can often be hard to make sense of. Log parsing is the key to turning raw, unstructured data into something useful. In this blog, we'll explore the basics of log parsing, its importance, and how it helps you extract valuable insights from your logs without all the clutter.

OpenTelemetry Processors: Workflows, Configuration Tips, and Best Practices

Most developers are familiar with Opentelemetry core components—Traces, Metrics, and Logs. But there’s one part of the OpenTelemetry ecosystem that doesn’t always get the spotlight: processors. These behind-the-scenes operators shape your data pipeline, helping you filter, enrich, and fine-tune telemetry data before it reaches your backend systems. Processors play a key role in making sure your data is cleaner, more useful, and just the way you need it.

Syslog Protocol: A Reference Guide

Syslog was developed in the 1980s by Eric Allman as part of the Sendmail project and adopted by many systems over the years. When looking at Syslog, there are a few protocol options, each with slight differences. In this reference guide, I’ll break down the differences so that you have a guide to see these formats when utilizing this protocol.

How CXOs can simplify compliance in high-regulation sectors

How do businesses in highly regulated sectors ensure network compliance while still fostering innovation and maintaining operational efficiency? As regulatory pressure and operational complexities increase, along with the growing divide between external demands and internal capabilities, traditional approaches to compliance are becoming outdated and insufficient for the future.

DOES Cache Rule Everything Around Me? - Using Compression for our Prometheus Cache

Checkly is a key part of a professional developer’s workflow, making it easy to know if your service is up or down, and measure performance. As we integrate with almost any development workflow, we also have Prometheus endpoints to let you use the popular Grafana stack to keep track of your site checks’ status. As large enterprise users grew in usage, their check performance data grew in parallel, and our endpoint started returning occasional 429 status codes.