Operations | Monitoring | ITSM | DevOps | Cloud

Sponsored Post

How to Reduce Continuous Monitoring Costs

Continuous monitoring is a crucial practice in the fields of DevOps, cybersecurity, and compliance. It involves the proactive and ongoing process of observing, assessing, and collecting data from various systems, applications, and infrastructure components in real-time or near real-time. Continuous monitoring is closely related to observability, which goes beyond simple monitoring to provide a deep understanding of complex and dynamic systems.

How Data Ingestion Works in Elasticsearch (Quick Guide)

Before you can search, analyze, or visualize anything in Elasticsearch, you need data ingestion. In this quick guide, we explain how data moves from raw logs, metrics, or JSON into an index using tools like Logstash, Beats, or language clients. Learn why consistency matters more than perfection and how once data is ingested, it’s ready for search, analysis, and insight.

How to Monitor OTP-Protected Web Applications

If you’ve ever used an online banking application to complete a transaction or gone through a checkout on an e-commerce platform, chances are you’ve utilized or interacted with an OTP-protected application. One-Time Password (OTPs) are at the center of most multi-factor authentication (MFA) systems. OTPs are temporary codes delivered by SMS, email, authenticator apps, push notifications, etc.

95% of AI Pilots Fail - Here's How to Be the 5%

When MIT released research showing that 95% of enterprise AI pilots fail to deliver measurable business impact, it made headlines for a reason. After years of heavy investment in artificial intelligence, the vast majority of organizations still haven’t moved beyond pilots that promise much but deliver little. This doesn’t mean AI itself is broken. In most cases, the technology performs as intended.

(ServiceNow + Kentik) From Reactive to Proactive: The Rise of Agentic Networks

Agentic AI is not just hype—it’s a force multiplier that enables infrastructure and operations teams to do more, with less effort, in less time. Importantly, it helps IT teams compress time to resolution and even proactively detect and respond to issues, before they escalate.

Nonsense Networking: Tech Talk #8

Ever feel like getting simple data from your network is way harder than it should be? You're not alone. With so many devices, the amount of data can be overwhelming, making it tough to see what's actually happening. In this stream, we're breaking down the common frustrations with network monitoring. We'll cover: The SNMP Problem: We'll start with why the "standard" method, SNMP, is often a pain. We'll look at the challenge of finding the right MIBs and OIDs just to get tools like Telegraf or Prometheus to work.

AI, IT and HR: Strategy, Risks, and the Future of Work (w/ Ben Eubanks)

Tim and Tom are joined by Ben Eubanks, Chief Research Officer at Lighthouse Research & Advisory, bestselling author of Artificial Intelligence for HR, and a leading thinker on the intersection of people, technology, and the future of work. Together they explore how AI is reshaping HR — not only in how the function operates day-to-day, but also in how it redefines HR’s outward-facing strategic role in organizations.

ALL NEW PARALLEL: S1E4 - The Virtual Murder

Back AGAIN by popular demand! Here's episode 4 of Tim Flower's brand new IT mystery series, Parallel. The team at Zentech has been hard at work investigating a murder. This time it's actually time that's the victim, and everyone is a suspect - especially the VDI team. "The Foundation" makes another appearance, and the team gets closer to uncovering the source of many of their recurring IT issues, with the help of an AI assistant and a very persistent DEX team.

Set up Splunk AI Assistant for SPL in Enterprise environments with Cloud Connected Integration

Unlock the power of the Splunk AI Assistant for SPL in your enterprise environment! In this quick tutorial, we'll walk you through the entire process, from downloading the app on Splunkbase, accepting the license agreement, and installing it in your environment, to completing the cloud-connected configuration which now allows you to use the AI Assistant in even more environments!

Serverless Applications: Why Monitoring is Essential for Speed and Reliability

Serverless applications are becoming the go-to architecture for modern developers. Startups and enterprises are building serverless applications because they offer scalability, cost-efficiency, and flexibility. However, these advantages come with unique challenges, especially when it comes to monitoring serverless applications. Traditional server monitoring tools fail to capture short-lived functions, making serverless application monitoring essential for maintaining performance and reliability.

Why Do SSL Certificates Fail in Multi-Cloud Environments (AWS, Azure, GCP)?

SSL certificates keep websites and apps secure, but in AWS, Azure, and Google Cloud Platform (GCP), misconfigurations or expirations can still cause services to go offline. Why do these failures happen, and how can you prevent them?

A Practical Guide to Python Application Performance Monitoring (APM)

When your Python app starts slowing down, maybe queries are taking longer, memory keeps creeping up, or API calls are lagging—basic server metrics won’t tell you why. You need to see what’s happening inside the application itself. That’s the role of Application Performance Monitoring (APM). It gives you a breakdown of database queries, external API calls, memory usage, error rates, and more, so you can connect the dots between code and performance.

High Availability by Design | WhatsUp Gold

As IT environments grow more distributed and resilient, the Progress WhatsUp Gold network monitoring solution is evolving to meet the moment. Starting in early 2026, Progress will officially retire the legacy Failover Manager and usher in a new era of high availability (HA) by design. This modern, scalable approach aligns with today’s best practices in infrastructure. Find more information on High Availability by Design.

DORA Compliance Software Options And Use Cases

DORA entered into application on January 17, 2025, and since then, DORA compliance software, such as Spektion, has become an essential part of many DORA-compliant workflows. However, in this article, we go beyond just one software solution and round up the most common DORA compliance software categories that covered entities are currently using. We also examine what they excel at and how they come together in the context of DORA compliance.

Top tips to keep calm when everything is needed ASAP

Top tips is a weekly column where we highlight what’s trending in the tech world and list ways to explore these trends. This week, we’re looking at how to keep your cool when everything lands on your desk with an ASAP tag. There’s always that day at work. Meetings stacked back-to-back, emails piling faster than you can open them, and just when you think you’ve got a handle on things, your boss drops the golden line: Can you get this done today?

Getting started with Jira dashboards

Jira is an industry favorite when it comes to managing software projects, yet its native dashboards can sometimes leave teams wanting more insight. The default views give a general update, but often lack the connection to the day-to-day activity happening in other parts of your workflow. As organizations use a wide mix of modern tools – from code repositories and cloud services to spreadsheets and reporting apps, it’s easy for critical details to get scattered or overlooked.

Real-time Alerting for Data Center Networks

Kentik’s Phil Gervasi shows how modern data centers—especially those powering AI workloads—can spot and fix problems before they impact performance or budgets. See how Kentik’s Data Explorer helps you identify disruptive flows, reclaim wasted network capacity, and turn insights into real-time alerts. With monitor-only mode and integrations with systems like PagerDuty and ServiceNow, your network becomes its own early warning system—driving uptime, cost savings, and better AI performance.

Reality Bytes #62: Digital Overload - The Distraction Episode

In this episode of Reality Bytes, Sean, Oriana, Dina, Tim, and Tom dive into the ever-present challenge of digital distraction in the workplace. From smartphones and smartwatches to endless Teams notifications, our panelists share their personal "kryptonite" when it comes to staying focused.

How should Prometheus handle OpenTelemetry resource attributes?

Note: A version of this post originally appeared on the OpenTelemetry blog. Victoria Nduka is user experience designer and open source contributor making her way into the cloud native space. She writes about design, accessibility, and open source with the same curiosity she brings to her work. On May 29, I wrapped up my mentorship with Prometheus through the Linux Foundation mentorship program.

The core KPIs of LLM performance (and how to track them)

A few months ago, I built an MCP server for Toronto’s Open Data portal so an agent could fetch datasets relevant to a user’s question. I threw the first version together, skimmed the code, and everything looked fine. Then I asked Claude: “What are all the traffic-related data sources for the city of Toronto?” The tool call fired. I got relevant results. And then I hit an error: “Conversation is too long, please start a new conversation.” I had only asked one question.

Understanding Incident Response vs Incident Remediation

At a high level, incident remediation is a part of the incident response process. An Incident response plan manages the incident lifecycle across planning, detection, investigation, and recovery. Meanwhile, incident remediation focuses on identifying root causes and implementing measures to prevent future occurrences.

Data Sovereignty vs Data Residency vs Data Localization

Awareness of data sovereignty is increasing within organizations. Geo-political situations and recent news stories are causing many to formally evaluate their data management strategies and policies. This means that organizations are also looking at the tools and platforms they use to run and maintain key IT infrastructure and undertake tasks such as monitoring and management. SaaS and cloud first/only tooling can often present data sovereignty challenges and complications.

Built for Scale: Why Enterprises, GSIs, and MSPs Choose ScienceLogic for Intelligent Operations

As companies shift towards a digital first strategy and enterprise architectures become more complex to support it, the demands on IT operations platforms have evolved significantly. Today’s global enterprises, system integrators (GSIs), and managed service providers (MSPs) require more than traditional observability tools. They need scalable, intelligent platforms that can manage sprawling environments with consistency, speed, and precision.

Updated Guide: Using Tracealyzer with IAR Embedded Workbench for Arm

Using IAR Embedded Workbench for Arm with an IAR I-jet probe? Did you know this provides an excellent data channel for Tracealyzer trace streaming? We have just updated Percepio Application Note PA-023 with a simpler setup for trace streaming over ITM/SWO, enabled by improvements in IAR’s ITM logging support. This makes it easier than ever to combine IAR’s powerful debugging with Tracealyzer’s RTOS-level insight. Read the updated guide here.

The vendor trap: why your next outage won't be your fault-but will be your problem

Today’s enterprises don’t run on singular self-contained systems—they’re intricate webs of interdependence: cloud services, APIs, CI/CD tools, DNS, CDNs, SASE vendors, identity management providers, cloud interconnects, ISPs, SaaS applications, application components, microservices, etc. A recent industry survey found that 84% of organizations suffered operational disruption from third-party risk incidents, with 66% facing adverse financial impact.

The Right Tool for the Right Job: How to Bring CSV Data into InfluxDB 3

Comma-separated value (CSV) files are one of the simplest formats for structured data and remain widely used across industries. From machine exports to business reports, CSVs are easy to create, edit, and share. They serve as a backbone for data management, ensuring teams can exchange information quickly and consistently. However, CSVs alone are static. When ingested into a time series database, they shift from flat files to part of a living data pipeline.

Windows Security Event Collection for Microsoft Sentinel with Datastream

Collecting Windows Security Events has always been a necessary but difficult job. Traditional methods depend on third-party collectors that must be installed, configured, and constantly maintained. They break, they lag behind updates, and they create unnecessary operational work. At the same time, they often flood Microsoft Sentinel with redundant or irrelevant data, driving up costs and slowing down investigations.

What is Database Monitoring

Database monitoring transforms from a reactive troubleshooting exercise into a proactive optimization strategy when you have the right tools and approaches in place. This blog shares practical ways to choose monitoring solutions, set up observability for different database platforms, and design workflows that scale in modern distributed systems.

OpenTelemetry Deep Dive: Resilience & High Availability in the OTel Collector

Missed it live? Catch the full recording of OpenTelemetry Deep Dive: Resilience & High Availability in the OTel Collector — a 1-hour workshop on building telemetry pipelines that never drop a signal. We’ll show you why resilience matters, how to design high-availability architectures, and how to configure the OpenTelemetry Collector with retries, batching, and persistent queues. Plus, you’ll see live demos in both Docker and Kubernetes — including scaling Gateway collectors with an HPA — and how Bindplane makes large-scale management seamless.

Advances in Furnace Repair Through Modern Technology

Heating systems have long been a cornerstone of comfortable living, keeping homes and workplaces warm through the coldest months. Over time, furnace repair has shifted from manual inspection and guesswork to a field guided by technological precision. These advances not only improve repair accuracy but also reduce downtime, energy costs, and long-term maintenance burdens. Modern tools, smart diagnostics, and digital platforms have shaped an environment where technicians can provide faster, safer, and more effective care for heating systems.

Stay ahead of downtime: OpManager's new mobile widgets redefine on-the-go network monitoring

In today’s mobile-first IT world, the difference between reacting late and staying ahead lies in how quickly you access critical data. For network admins who are always on their toes, waiting to open the app to check for alarms or down devices can be a bottleneck. That’s where OpManager’s latest mobile app upgrade comes in, with interactive home screen widgets designed to deliver instant visibility and control—without even launching the app.

What's New in InfluxDB 3.4: Simpler Cache Management, Provisioned Tokens, and More

Today, we’re releasing InfluxDB 3.4 for Core and Enterprise, as well as our 1.2 update for the Explorer UI. This release focuses on developer efficiency, operational automation, and targeted security enhancements, giving teams faster setup, smoother workflows, and stronger guardrails for production use. InfluxDB 3 Core is free and open source, optimized for recent data, and licensed under MIT and Apache 2.

The business impact of Elasticsearch logsdb index mode and TSDS

The Elasticsearch storage engine team has made significant strides in improving storage efficiency and performance in Elasticsearch 8.19 and 9.1. Now that these changes are available, what impact can they have on your business? And how do you make the most of them?

The inadequate guide to Rails security

If you're like me, you got into this business because you love building awesome apps. If you've been in the development space long enough, you'll eventually have to do work on those awesome apps that doesn't feel so awesome. Security can be one of those things. Taking Rails security seriously is important, even though the Rails framework does much of the heavy lifting. Before we get too deep into the details of Ruby on Rails security, let's take a second to reflect on the good times. ...

Smarter Network Monitoring: Reduce Alert Noise for MSPs & IT Teams

If you’ve ever worked in a loud office, you know the drill: A co-worker’s on a call, someone’s talking about the next Taylor Swift album in the break room, another’s constantly clearing their throat, and the HVAC sounds like a jet engine. It’s loud. Your brain tries to filter it all out, but it’s no use. Then you put on noise-canceling headphones… and suddenly, you can think again.

Optimize application performance at the network layer: introducing HTTP Performance Insights in Frontend Observability

Imagine you’re a frontend engineer monitoring the user experience for an e-commerce app. You notice your checkout flow has a 15% abandonment rate. Your API responses are inconsistent. Your users are frustrated, and you’re drowning in data and complex queries trying to figure out why. Sound familiar? You can use real user monitoring (RUM) to determine what has happened, looking at page load times, error counts, user sessions, etc.

Full-Circle Observability: Using SigNoz to monitor a LangChain agent that queries SigNoz MCP

In Part 1 of this series, we explored how to instrument a LangChain trip planner agent with OpenTelemetry and send telemetry data to SigNoz. By tracing each step of the planning process: LLM reasoning, tool calls for flights, hotels, weather, and activities, and the final itinerary response, we saw how observability turns a black-box agent workflow into a transparent, debuggable system.

LangChain Observability: How to Monitor LLM Apps with OpenTelemetry (With Demo App)

LangChain has become one of the most popular frameworks for building LLM-powered applications, making it easier to create agents that can reason, plan, and take actions. But like any production-grade AI app, LangChain agents can run into performance bottlenecks, hallucinations, or tool call failures. And without proper LangChain observability, it’s hard to know where things break down.

EP #2: Valkey, Vector, Redis, and the History of Databases - The Open Source Observability Podcast

In this episode we learn how Valkey, the lightning-speed open source key-value datastore, can help improve your observability toolstack. Dive in to learn what differentiates a NoSQL data store from a relational database, more about data structures such as HyperLogLog and Bloom Filter, and all about the history of how data is stored.

You don't control most of the infrastructure your digital services rely on.

However, your customers still expect a flawless experience, every time. The complexity of modern architectures (CDNs, DNS, APIs, cloud platforms) means that even “simple” applications can break in ways you don’t see coming. So how do you stay ahead of issues you don’t even own? By monitoring the digital delivery chain as your users experience it, across networks, geographies, and third-party dependencies, and catching performance degradations before they become business problems.

Eliminate cloud waste across AWS, Azure, and Google Cloud with Cloud Cost Recommendations

As organizations increasingly adopt multi-cloud strategies, identifying areas to reduce cloud spend has become highly complex and time consuming. While there are many reasons that organizations choose to run their infrastructure in a multi-cloud environment, many do so to comply with regional data requirements, take advantage of best-of-breed offerings, or avoid vendor lock-in.

Reduce cloud waste with Datadog Cost Recommendations

Struggling to optimize your cloud spend across AWS, Azure, and Google Cloud? Datadog Cloud Cost Management highlights underutilized or legacy resources and lets engineers take immediate action using Datadog Workflows. Eliminate waste and drive savings with recommendations that your teams can trust.

Optimize Kubernetes and Container Costs with Datadog Cloud Cost Management

Struggling to understand the true cost of your Kubernetes workloads? With Datadog Cloud Cost Management, you can automatically allocate container costs by team, product, and service down to the pod. Instantly identify idle resources, surface optimization opportunities, and act with confidence. All in one unified platform.

How to surface misconfigured resources by defining policies | Datadog Tips & Tricks

Misconfigured infrastructure resources can be easy to miss, especially in multi-account or multi-cloud environments. From EKS clusters running on deprecated versions to RDS engines on extended support, these issues can disrupt services or drive up costs if left unchecked. In this video, we show you how to: By centralizing policies, you’ll gain a clear view of where to focus your remediation efforts.

Tech Talk - Mastering Data Pipelines Unlocking value with Splunk

On this Tech Talk to learn how Splunk can help you unlock the value of your security and observability data by building an effective data management strategy. Understand how Splunk’s approach to federated data management can help you maximize the value of data. Build effective pipelines using our latest SPL2-powered data processing capabilities to collect, transform and route data based on your business needs. Run effective searches on data in Amazon S3 without having to ingest or index data into Splunk.

Tech Talk - Aligning Observability Costs with Business Value Practical Strategies

Learn how to tackle the challenges of growing telemetry data and optimize your observability model to maximize value while minimizing costs. This session will explore strategies to reduce log ingestion, centralize pipeline management, and gain visibility into metric usage to identify waste.

AI That Knows Networking: Selector vs. Generic GPT Integrations

The hype around generative AI has led many IT teams to experiment with plugging generic GPT models into their workflows. On paper, this is the beginning of true AI networking, featuring conversational interfaces, instant summaries, and faster troubleshooting. However, as we discussed in the previous post, “Why Your IT Copilot Needs Context, Not Just Data,” copilots are only as effective as the intelligence behind them.

Monitoring websites from the United States

Monitoring your websites from the US region is critical for serving users from the US as it helps you improve website performance and compliance-related practices, ensure business continuity, and offer a better customer experience. Just a few milliseconds of added lag specific to US connections can impact bounce rates and conversions, making localized monitoring essential.

Observability Without Limits - Uptrace Pricing Explained

Welcome to Uptrace, the modern observability platform. Our pricing is simple: pay only for the data you ingest. Unlimited users, services, and hosts Billed per uncompressed GB for spans & logs Billed by active timeseries for metrics Automatic volume discounts as your usage grows Free trial includes 1 TB of spans & logs and 100,000 timeseries — no credit card required.

Catch core banking issues (before they impact customers and compliance)

APAC customers have high expectations around instant payments, open banking, and mobile-first experiences. In March 2025, India’s real-time payment system, UPI went down for five hours. Millions experienced payment failures, failed fund transfers, and login errors and many vented their frustrations on social media. With banking and payment disruptions on the rise, regulators are calling for proof of resilience.

Targeting hosts and services in Icinga 2 API requests

Today, we are going to take a look at the Icinga 2 API and the various ways targets can be specified for different actions, such as querying information or scheduling downtimes. This post focuses on the API request payloads themselves and assumes some familiarity with sending requests to the Icinga 2 API. Please refer to our documentation for the missing details if you want to try the requests yourself. In general, specifying the objects to which an action applies works the same way for all actions.

Will custobots drive $98 trillion in payments by 2027?

It starts like this. You wake up groggily and stumble into the kitchen. The coffee machine is already brewing your favourite blend. But that’s not the surprise. It's the message on your phone: "Coffee beans restocked. $10 paid. Delivery by noon." You didn’t place an order. You didn't lift a finger. Your machine did it for you. Welcome to 2025, where your devices aren't just smart, they're economically independent. These are machine customers (custobots). They negotiate. They pay.

5 DevOps Team Structures (Plus Actionable Strategies for Automation, Monitoring & Culture Change)

An effective DevOps team is about creating the right structure, culture, and processes that enable collaboration across traditionally siloed departments. The right DevOps team structure can dramatically improve software delivery speed, reliability, and overall customer satisfaction. But what exactly makes a great DevOps team? And how can you build one that works for your organization?

How to measure and fix latency with edge deployments and Sentry

A 2017 study by Google, researchers found: That was over 8 years ago. And let’s be honest, it’s not likely users have found any additional patience in that time. Web Vitals are a set of performance metrics defined by Google that measure user experience. They focus on things like LCP (how long the main content takes to load), INP (how quickly the page responds to input), and CLS (how visually stable the app is, meaning whether content shifts unexpectedly).

Put Cloud Costs in Front of Engineers with Datadog Cloud Cost Management

Tired of surprises on your cloud bills? With Datadog Cloud Cost Management integrated into the Software Catalog, engineers see cost, performance, and reliability side by side—no context switching required. Give every service owner the visibility they need to make cost-aware decisions.

Monitor Apple Silicon GPU on macOS with macmon + Hosted Graphite

Your Mac’s GPU is a massively parallel processor that handles anything from animating the UI to heavy lifting in video editors, 3D tools, games, and on-device machine learning models. Think Final Cut Pro exports, Blender renders, Stable Diffusion, WebGPU demos, or shader builds in Xcode - which are all tasks that require heavy GPU.

Introducing ping and TCP port monitoring (and lots of other improvements)

A couple months ago, we sent out a survey to all our users asking what they like about Oh Dear, how they use it, and how we could improve our service. One of the things that was asked a lot was ping and TCP port monitoring. The past few months we worked hard to add this kind of monitoring to our service. And while building it, we touched upon other parts of our service and improved lots of little things. And I'm proud to share that we now have shipped it all! Let's go through it!

Track Cloud Unit Economics with Datadog Cloud Cost Management

Do you know the true cost per user, API call, or checkout? Datadog Cloud Cost Management lets you break down spend by combining cost, observability, and custom business metrics—all in one place. Track cost per transaction, alert on changes, and align engineering and finance with real-time unit economics.

We vibe coded a path tracer: Here's how we used static and dynamic analysis to fix it

When developing software, the longer you intend to keep a system around, the more important it becomes to prioritize its code quality. But as more organizations move toward microservice architectures and adopt agentic AI and LLMs into their development workflows, many engineering teams have increased their emphasis on accelerating developer velocity, often at the expense of code quality. This can often result in code that fails to meet standards for performance, reliability, and security.

Tech Talk - Holistic Visibility and Effective Alerting Across IT and OT Assets

On this Tech Talk to learn how to gain complete visibility into all hosts and their potential vulnerabilities, misconfigurations and unpatched components in a single analytics platform, adding Tenable asset and exposure risk context improves alert prioritization and joint customers use Splunk for Centralized Reporting.

ManageEngine recognized as a Customers' Choice in the 2025 Gartner Peer Insights Voice of the Customer for Network Management Tools

We are thrilled to share that ManageEngine has been recognized as a Customers’ Choice in the 2025 Gartner Peer Insights Voice of the Customer for Network Management Tools. We are even more excited to be the only vendor positioned in the Customers' Choice quadrant for this category! This recognition is especially meaningful because it's completely based on reviews and feedback from our customers.

Visualize Logs Alongside Metrics: Complete Observability for Slow PostgreSQL Queries

When latency creeps into your app, metrics tell you that performance regressed, but logs tell you why. PostgreSQL’s slow-query logging gives you the exact statement, duration, user, and database which is perfect for hunting down missing indexes, inefficient filters, or N+1 patterns.

Evaluate and Improve Your Site's Web Performance With Honeycomb for Frontend Observability

As an engineer on Honeycomb’s frontend platform team, I’m constantly trying to understand and improve our web performance. And I have a whole lot of questions. I tried answering these types of questions without Honeycomb in the past, and it was difficult and time consuming. It used to take me days to identify performance issues and their causes, let alone fix them and confirm that they improved web performance for some subset of users.

Manage your dashboards and monitors at scale

In the early stages of building a system, a few well-placed dashboards and monitors can provide sufficient visibility into service health and performance. However, as infrastructure scales and teams grow, so does the complexity of the monitoring landscape. In organizations where individual teams manage their own services but rely on a central platform or observability team for tooling and guidance, this complexity can quickly multiply.

Exploring our new PHP SDK, built using Saloon

Today, next to Ping and TCP monitoring, we've also launched a new PHP SDK package, which has been rebuilt from scratch using the wonderful Saloon library. Using our new SDK, you can easily use the entire Oh Dear API. In this blog post, I'd like to show you how you can use the new SDK and how it works under the hood.

The Complete Angular Error Handling Guide for Production-Ready Apps

Your Angular app just crashed in production with ‘ERROR Error: Uncaught (in promise): ’. Sound familiar? After debugging countless production fires, I’ve learned that proper error handling isn’t optional—it’s the difference between sleeping through the night and getting paged at 3 AM.

What's new for scheduling and resource management in Kubernetes v1.34?

Kubernetes v1.34, which is scheduled for release August 27, 2025, focuses on improved scheduler visibility, deeper life cycle observability, and enhanced resource management. As always, the list of changes and improvements in the official changelog is extensive, and cluster operators may be wondering which changes are most important. If you're operating a monitoring platform or depend on deep Kubernetes observability, here's how a number of new features will affect your workflows.

Caddy Webserver Data in Graylog

If you’re running Caddy Webserver on Ubuntu, Graylog now has a new way to make your access logs more actionable without tedious parsing or manual setup. The new Caddy Webserver Content Pack, available in Illuminate 6.4 and a Graylog Enterprise or Graylog Security license, delivers ready-to-use parsing rules, streams, and dashboards so you can quickly turn raw logs into structured, searchable insights.

Real User Experiences: How Auvik Network Management Transforms Remote Support

When distributed teams need network support, traditional approaches often fall short. The difference between a quick remote fix and hours of on-site troubleshooting can make or break productivity for organizations with dispersed infrastructure. Based on feedback from real users on PeerSpot, an enterprise technology buying intelligence platform, Auvik Network Management is changing how IT teams deliver remote support by eliminating common barriers and reducing resolution times.

How Auvik Network Management Optimizes Network Performance: Real User Insights

Network performance challenges can cripple business operations, leaving IT teams scrambling to identify bottlenecks while users experience frustrating slowdowns. Without proper visibility into bandwidth utilization, latency issues, packet loss, and network availability, organizations risk reactive troubleshooting that costs time and productivity.

OpenTelemetry API vs SDK: Understanding the Architecture

When you're instrumenting applications with OpenTelemetry, you'll encounter two core components: the API and the SDK. The API defines what telemetry data looks like and how it is created, while the SDK handles how that data is processed and exported. Understanding this split helps you build more maintainable observability and avoid tight coupling between your business logic and telemetry infrastructure.

Raising the bar in observability and security: Coralogix extensions at scale

In today’s high-velocity digital ecosystem, visibility isn’t enough. SREs and engineering leaders need real-time insights, actionable signals, and automated workflows to operate at scale. As systems grow more distributed and cloud-native, the demand for intelligent observability and security has never been higher. Extensions are solutions to get instant observability with prepackaged parsing rules, alerts,dashboards and more.

Grafana Campfire - Using the Drilldown Apps (Grafana Community Call - August 2025)

In this Campfire Community call, we will discuss about the new Grafana Drilldown Apps and how they differ from Explore. We will discuss how it has been continuously evolving to become a core part of Grafana OSS, enabling users to access data easily.

Identify slowdowns across your entire network with Datadog Network Path

As modern infrastructure becomes increasingly distributed across on-premises data centers, multi-cloud environments, ISPs, and remote offices, understanding how traffic flows across your network is critical to delivering reliable performance and great user experiences. But pinpointing the source of network slowdowns remains one of the most persistent challenges for operations, network, and IT teams.

How to Prove DNS Monitoring ROI to Clients (Without Getting Technical)

Most clients don’t care how DNS works—until it breaks. But as an MSP, you know the damage a single DNS misconfiguration or unnoticed change can cause. So how do you prove the ROI of DNS monitoring to clients who don't speak in TTLs or CNAMEs? Here’s how to bridge the gap between technical benefits and business value—so your clients understand exactly why they’re paying for DNS protection.

How Tipalti mastered Elasticsearch performance with AutoOps

From manual monitoring to proactive optimization, learn how Tipalti used AutoOps to save 10% annual costs. For a global payables automation leader like Tipalti, where financial transactions are the lifeblood of the business, infrastructure performance isn't just a technical goal; it's a core business requirement. Managing a complex ecosystem of databases, including Postgres, SQL Server, MongoDB, Kafka, and Elasticsearch, with a lean team of four engineers demands efficiency and powerful tooling.

High Availability by Design: WhatsUp Gold Strategic Shift from Failover

As IT environments grow more distributed and resilient, the Progress WhatsUp Gold network monitoring solution is evolving to meet the moment. Starting in early 2026, Progress will officially retire the legacy Failover Manager and usher in a new era of high availability (HA) by design. This modern, scalable approach aligns with today’s best practices in infrastructure.

The Convergence of ITSM and EAM: Why Unified Operations Matter More Than Ever

The need to differentiate IT Service Management (ITSM) and Enterprise Asset Management (EAM) has now become impractical in an era of immense technological complexity and unlimited demands for operational efficiency. Organizations today increasingly rely on both digital services and physical assets to derive value; however, siloed processes and disparate data repositories lead to slow incident resolution, uncoordinated change initiatives, and hidden risks.

How to Monitor WiFi Access Points: Best Practices for Business WiFi

WiFi Access points (APs) are the foundation of business WiFi. They’re the devices making sure laptops, smartphones, and even IoT gadgets connect reliably without cables. If an access point fails or becomes overloaded, the entire wireless experience can collapse, no matter how strong your Internet connection is. By keeping a close eye on your APs with the right WiFi access point monitoring software, you can catch issues before users even notice them.

The Outage You Didn't See Coming: How to Discover and Monitor Certificates Proactively

Progress WhatsUp Gold Certificate Discovery and Monitoring is a seamless capability included out of the box. It’s a proactive safeguard designed to help you spot certificate issues before they escalate into business problems.

Grafana Mimir: 3 reasons to run the TSDB for Prometheus on bare metal

Wilfried Roset is an engineering manager who leads an SRE team and he is a Grafana Champion. Wilfried currently works at OVHcloud where he focuses on prioritizing sustainability, resilience, and industrialization to guarantee customers satisfaction. Whether it’s for efficient resource allocation, flexibility, high availability, or scalability, it makes a lot of sense to run Grafana Mimir on Kubernetes—but it’s not the only way to deploy Mimir.

Alerting Best Practices

A firing alert is like someone ringing your doorbell - it demands your immediate attention, interrupting whatever else you’re doing. It requires focus and a quick response. But imagine trying to live in an apartment where the doorbell never stops ringing. You could put in earplugs to block the noise, but that only masks the problem - it doesn’t solve it. On the other hand, disconnecting the doorbell entirely isn’t a solution either.

When Milliseconds become Make-or-Break, Fragile Ops are a Brand Liability

 A major studio drops its new episode at midnight. Millions are queued to watch. Push notifications hit, the app surges in traffic, and then timeout. Spinning wheels. Frozen screens. Social media lights up. Customers don’t just notice they remember. For today’s communications, media, and information (CMI) brands, digital reliability is the product. Viewers, subscribers, and enterprise users aren’t comparing your uptime to industry benchmarks.

Why Your IT Copilot Needs Context, Not Just Data

In the rush to adopt AI in IT operations, many organizations focus on feeding copilots as much data as possible. But here’s the problem: data without context is just noise. An IT copilot that can’t distinguish what matters from what doesn’t won’t reduce alert fatigue or accelerate troubleshooting.

Debugging Slow PHP Applications with APM Tools

A slow PHP application in production is not just a performance issue, it poses a significant risk to business operations and user satisfaction. Slow page loads frustrate users, increase bounce rates, and directly impact revenue. For developers, the bigger challenge is that these slowdowns often hide deep in the code, database queries, or external dependencies, making them hard to find.

Instrument your Azure Container Apps workloads with the new Datadog Agent sidecar

Modern application development is evolving rapidly, with serverless containers and microservices becoming the standard for scalable, resilient architectures. Azure Container Apps is at the forefront of this movement, enabling developers to deploy containerized applications without having to manage infrastructure.
Sponsored Post

Atlassian Bitbucket Monitoring on Microsoft SCOM

As part of a customer project, we developed a custom Bitbucket Management Pack for Microsoft System Center Operations Manager (SCOM). This tailored solution enables IT operations teams to monitor key performance and health metrics of Bitbucket environments, ensuring planning and bug-tracking platforms remain available and performant. With this Use Case paper, we aim to share our knowledge with the SCOM community, highlighting the possibilities of advanced monitoring on Microsoft SCOM and helping teams improve their day-to-day tasks.

Common Issues in PHP Applications and How Monitoring Tools Help

PHP has been powering the web for over two decades and continues to be a dominant server-side scripting language. From small business websites to massive enterprise applications, PHP sits at the heart of many critical digital experiences. "But with great popularity comes great responsibility and challenges" Performance bottlenecks, security vulnerabilities, and inefficient coding practices can cripple applications, frustrate end-users, and burn out engineering teams.

Why SaaS Startups Need PHP Application Monitoring for Scalability

For SaaS startups, speed and reliability are everything. A few seconds of downtime or slow performance can turn away users, impact sign-ups, and directly affect revenue. Unlike traditional apps, SaaS platforms operate on an always-on model, which means performance and scalability must be built in from day one. PHP remains one of the most popular choices for startups due to its flexibility, cost-effectiveness, and fast development cycle.

How To Visualize Your Sales Data: Salesforce Enterprise Data Source for Grafana

Learn how to monitor your organizations sales performance by connecting Salesforce with Grafana! In this quick-start tutorial, Shawn Pitts walks you through everything — from setting up your Salesforce connection to visualizing real-time data in Grafana. Whether you’re on a free Grafana Cloud plan, a paid tier, or running Grafana Enterprise on-prem, you’ll see exactly how to unlock powerful dashboards for your team.

Your Help Desk Can Be a Powerful Ally in Maintaining HIPAA Compliance

Each industry has standards and regulatory compliance concerns. The health care industry arguably has the most well-known, thanks to the Health Insurance and Portability Accountability Act (HIPAA) and its efforts to keep electronic protected health information (ePHI) safe. HIPAA compliance is essential for organizations that store, maintain or transmit ePHI and staying on top of HIPAA regulations can be challenging.

Fix It Fast: Tips, Tricks & Tools for Sumo Logic Success -- Customer Brown Bag -- August 21st, 2025

Led by Sumo Logic experts Andrei and Austin, this session dives into troubleshooting dashboards, silent failure scenarios, and missing collector data—helping your team spot blind spots, catch incidents you never knew you missed, close visibility gaps, and ensure dashboards reflect the full picture for faster resolution.

What Is Vector Search? Difference Between Vector & Semantic Search Explained [Quick Question Ep. 5]

What is vector search? In this breakdown, learn how vector search leverages machine learning to capture the meaning and context of unstructured data by transforming it into a numeric representation that is stored in a vector database. This video also explains the difference between sparse and dense embeddings, and how vector search differs from semantic search and lexical search.

How to Reduce Downtime by 90% with Proactive Monitoring Strategies

Downtime costs businesses an average of $5,600 per minute according to Gartner research. For many organizations, even a few hours of unplanned outages can mean lost revenue, damaged reputation, and frustrated customers. The good news? You can reduce downtime by up to 90% by implementing the right proactive monitoring strategies.

How to go from ingestion to insights in 10 minutes

When assessing SaaS observability solutions, customers often explore features that are built into the platform, but there ia a whole collection of deployable libraries across all SaaS vendors. In Coralogix, we lead the way in deployable assets, with 4400+ alerts, dashboards, parsing rules, metric generation rules and more. But why should you care about these deployable assets, and why do they accelerate insight generation so profoundly?

The Smartest Member of Your Developer Ecosystem: Introducing the Mezmo MCP Server

Building a great developer experience is about more than just the code. It’s about creating a unified ecosystem where your tools work together seamlessly. That’s been the vision behind our work on the Mezmo MCP Server, and I’m excited to share it with you. At its core, the MCP Server is a universal remote for your data pipeline.

Announcing Monitor Grouping in UptimeRobot

We’re excited to introduce Monitor Grouping, a new way to organize your monitors directly from the UptimeRobot dashboard. This feature makes it easier to keep track of large sets of monitors and quickly see the health of related services at a glance. Monitor Grouping is available on Solo, Team, and Enterprise plans starting today. Downtime happens. Get notified! Join the world's leading uptime monitoring service with 2.1M+ happy users. Register for FREE.

APM Logs: How to Get Started for Faster Debugging

When application performance monitoring detects a spike in latency or error rates, the immediate challenge is determining the underlying cause. APM logs address this by correlating performance metrics with the specific log events that occurred at the same time. Instead of switching between monitoring dashboards and manually searching through log files, APM log correlation consolidates both views.

Anomaly detection explained: Why your monitoring needs it

Anomaly detection goes beyond fixed thresholds to catch the issues your monitoring might miss—like unusual latency spikes, sudden drops in traffic, or odd system behavior that doesn’t throw an error. In this video, we explain: With Site24x7’s AI-powered monitoring, anomaly detection is built-in—helping DevOps teams move from reactive fixes to proactive observability.

Monitor your mobile apps with Site24x7 Mobile real user monitoring (RUM)

Get end-to-end visibility into how your apps perform in the real world. Quickly detect app crashes, start-up delays, slow API calls, and performance issues after updates. Drill down by device, OS, network, or geography to troubleshoot faster and deliver seamless user experiences. Key highlights in this video: Stay ahead of performance issues and keep your users happy with Site24x7 Mobile RUM.

How Product Managers Can Benefit From Honeycomb

Observability tools like Honeycomb are built for engineers, not PM teams… but that doesn’t mean there’s no benefit to having your PMs in Honeycomb. Whether it’s debugging a weird customer issue or tracking how a feature is used in the wild, observability gives PMs something traditional product tools can’t: real-time answers with full context, down to a single user.

Visualize Salesforce data in Grafana: flexible query options, powerful data correlations, and more

As part of our big tent philosophy at Grafana Labs, we think you should be able to dig into your data and find meaningful insights — wherever that data happens to live. For many of our users, that data lives in Salesforce, the cloud-based customer relationship management (CRM) platform. In this post, we’ll take a closer look at how you can use the Salesforce Enterprise data source for Grafana to quickly and easily visualize your Salesforce data using Grafana dashboards.

Creating & Scheduling SLA Reports

Learn how to create an SLA report on Uptime.com to track uptime and performance. This guide walks you through configuring the report and selecting the right checks and date range. Get detailed metrics on uptime and response times, helping you meet service goals and client expectations with ease. Scheduled SLA reports in Uptime.com let you automatically send PDF or XLS reports to up to 100 recipients, including both Uptime.com users and external email addresses. You can schedule reports for daily, weekly, monthly, quarterly, or yearly delivery.

Reduce PHI Risk Exposure With a Strategy That Supports HIPAA Compliance

Health Insurance Portability and Accountability Act (HIPAA) compliance is about more than firewalls and passwords. Your file-sharing solutions could be the weakest link in protecting sensitive patient data. When we think about healthcare cybersecurity, we tend to focus on large systems: electronic health records, databases, and billing platforms. But one everyday workflow that’s also as vulnerable – and often overlooked – is file transfer.

How Much Time Could You Save with Network Config Automation?

If you’re a network admin reading this, you already know the feeling. You’ve probably lost track of how many hours you spend doing the same repetitive tasks week after week, month after month. Backing up configs manually. Rolling back failed changes at 2 AM. Hunting down that one switch that somehow lost its configuration. Compiling compliance reports that should take minutes but somehow eat up your entire afternoon. Yet all those “quick” tasks add up.

Energy Monitoring and Targeting: Saving Costs Through Proactive Billing Software

In today's energy-conscious world, businesses and utility providers alike are seeking smarter ways to manage costs, improve efficiency, and promote sustainability. Energy monitoring and targeting (M&T) has emerged as one of the most effective strategies to achieve these goals. By combining accurate monitoring with actionable insights, organizations can identify inefficiencies, reduce waste, and lower utility expenses.
Sponsored Post

Status Page Aggregator: How To Stay Ahead of Outages in 2025

Outages happen, and they often catch us off guard. If your team relies on multiple status pages to track cloud infrastructure, SaaS tools, or distributed systems, staying ahead of outages is essential. It's far better to know about issues with your services or dependencies before your users do, so you can act fast and stay in control. That's where a status page aggregator like StatusGator comes in.

Don't Just Monitor SLAs - Validate Them Automatically

Service level agreements (SLAs) are the contractual backbone between customers and technology vendors, outlining expected service availability, performance metrics, and remedies like service credits when service providers fail to meet agreed-upon service levels. This service agreement assures both the technical quality as well as the service quality of the services provided, and underpins the value perspective of the client.

Log Files Explained: Types, Uses, and Best Practices for IT Teams

Every system in your environment—cloud, on-prem, or hybrid—generates log files. They capture everything from user actions to system failures, security events, and performance issues. But with so many log types and so much raw data, it’s easy to get buried in noise and miss what matters.

How we saved $1.5 million per year with Cloud Cost Management

In collecting and analyzing trillions of events each day, Datadog ingests a massive amount of data. We spend substantially to process and store this data in the cloud, and teams across the organization are committed to optimizing the return on this investment. To this end, our FinOps analysts have always tracked the costs of delivering our services and identified opportunities for savings.

How our engineers use AI for coding (and where they refuse to)

Okay, picture this: if you drew a Venn diagram of folks in tech right now, it'd probably look something like this: You'll probably find yourself in one of those circles, right? I’m guilty of falling in the intersection! Because let's be real, the 'will AI replace developers by 20xx?' debate is everywhere – Reddit, Hacker News, team Slack and even your local cafe. Well, we decided to go straight to the source.

Datadog governance 101: From chaos to consistency

As your organization scales, managing observability resources and usage becomes increasingly important. More users and teams mean more dashboards, tags, API keys, and costs to manage. The job of keeping track of these resources and ensuring that they’re compliant can quickly grow in complexity.

Discover Infrastructure: Kubernetes & Hosts - Launch Week / Day 03

Stop debugging infrastructure issues across multiple dashboards. See how Last9's Discover Infrastructure monitors K8s pods and traditional hosts together—with resource analysis, pod-level debugging, and AI that correlates app problems to infrastructure root causes. One setup (K8s + host monitoring) → Complete infrastructure visibility that connects to your services and jobs. No more blind spots between application performance and underlying resources.

A Detailed Guide to Azure Kubernetes Service Monitoring

Azure Kubernetes Service (AKS) continuously generates a high volume of telemetry, ranging from node-level CPU and memory usage to request latencies and error rates within individual pods and services. Without a structured monitoring strategy, this flood of metrics can easily become noise, leaving teams blind to early warning signs. Effective monitoring in AKS is about identifying the right signals, correlating them across layers, and acting before they impact application performance or cluster stability.

Your Apps Are Green. Your Infrastructure Is Dying.

Launch Week Day 3: Introducing Discover Infrastructure Your dashboard looks perfect. APIs responding in 80ms, background jobs processing smoothly, error rates at 0.02%. Everything's green. Then production breaks. "Why is checkout so slow?" "The payment service keeps timing out!" You run kubectl get pods and discover payment-service pods restarting every 3 minutes due to OOM kills. Then you check your database host—CPU at 98% because someone forgot the new ML training job runs there too.

Grafana Cloud updates: onboard teams with new AI-powered tooling, secrets management for enhanced security, and more

We consistently roll out helpful updates and fun features in Grafana Cloud, our fully managed observability platform powered by the open source Grafana LGTM Stack (Loki for logs, Grafana for visualization, Tempo for traces, and Mimir for metrics). In case you missed them, here’s our monthly round-up of the latest and greatest Grafana Cloud updates. You can also read about all the features we add to Grafana Cloud in our What’s New in Grafana Cloud documentation.

Secure credential storage for your observability stack: Introducing secrets management in Grafana Cloud

The more your infrastructure grows, the more likely you are to face a familiar challenge: where to safely store the API keys, passwords, and tokens that power your observability stack. Unfortunately, a common response to this dilemma is to scatter credentials across configurations, making security and management of secrets increasingly complex.

React Native performance tactics: Modern strategies and tools

This is a guest post by Simon Grimm, founder of Galaxies.dev, a platform dedicated to helping developers master React Native through hands-on courses, expert guidance, and personal support. React Native performance matters more in 2025 than ever before. With the New Architecture now stable and apps competing against lightning-fast native experiences, users expect sub-second load times and buttery-smooth 60fps interactions.

Extending Unit-Testing on Icinga2

Obviously nobody is disagreeing with this. It’s just that during ongoing development and while focusing on features and bug-fixes, testing often falls behind in priority, especially when developers would need to write tests for existing or legacy code, teams can be hesitant to invest the time. C++ applications have to run a diverse set up target environments, varying in OS, compilers, C/C++ standard libraries and dependency versions.

Supercharge your Android app

In today’s technological landscape, mobile applications are on the rise, boosting efficiency, portability and accessibility in daily life, across a spectrum of industries, from financial services to food delivery. As mobile apps become more essential, the quality of their features, performance, and user experience is critical.l.

vmanomaly Deep Dive: Smarter Alerting with AI (Tech Talk Companion)

I was thrilled to host our latest tech talk, where we got to do a deep dive into vmanomaly with the best possible guests: Fred Navruzov, the actual team lead for the product, and Co-Host, Matthias Palmersheim. We covered a ton of ground, from high-level concepts to the nitty-gritty of configuration. For everyone who couldn’t make it, I wanted to share my personal recap of the most important technical takeaways from our conversation.

Proactive Observability - Predictive Analytics Models and Algorithms for IT Systems and Metrics

Predictive Analytics Models and Algorithms are an important component of eG Enterprise’s AIOps engine for proactive observability. eG Enterprise collects and analyses metrics, events, logs and traces and the data including real usage data is used to make intelligent predictions to forecast future system behavior and IT resource metric levels.

What's Hiding in Your Wiring Closets?

Let's be provocative for a moment. You probably don't know what is actually on your network. You have the CMDB, spreadsheets, diagrams from the last big refresh, and the institutional knowledge of your veteran engineers. But is this information accurate? Is it complete? Answering that question with absolute certainty can be difficult for many who manage complex IT environments.

Every second of digital downtime has a cost.

When a site disruption hits, businesses face immediate and visible fallout: customer churn spikes, and revenue takes a direct hit. If customers can’t transact, your bottom line suffers, plain and simple. This insight comes from a recent Forrester survey commissioned by Catchpoint, where respondents revealed the real business impacts of Internet disruptions.

Nginx Logs & Performance Monitoring with Loki and Telegraf | MetricFire

When a web service slows down or errors spike, metrics can tell you what changed (active connections rise, error rate increases), but the root cause can sometimes be found in your logs (which IPs are hammering POST endpoints, 4XX/5XX occurrences). Put the two together and you get the full observability picture. Time-series metric trends to spot incidents, and line-level details to fix them fast.

The Real Cost of Choosing the Wrong Database

Data is more than a record of what happened—it shapes what happens next. Across industries, connected devices continuously stream time-stamped data that reflects the current state of machines, environments, and systems. This steady flow gives organizations a live view of operations and the ability to catch issues early, adjust quickly, and operate more efficiently. However, capturing data alone does not create value.

Incident post-mortems: the complete, blameless guide

Most companies run post-mortems like autopsies. They dissect the corpse, assign blame, and file it away. The body count keeps rising. Here's what actually works: post-mortems as learning machines. Systems thinking over finger-pointing. Patterns over pain. What you'll get: A copy-paste template, real metrics that matter, and the mindset shift that turns outages into intelligence. Who this is for: SRE leads tired of repeating incidents. Engineering managers who want learning over theater.

Part Two - Event Intelligence vs. AIOps: Key Differences, When to Use Each and Why

The IT environments of large enterprises have become so complex that operational teams have turned to two solution categories in particular to help them improve visibility and gain faster incident response, automate and enable more effective decision-making.

How PHP Monitoring Helps Prevent Bugs in Production?

When a PHP application hits production, the stakes are high. Even a minor bug can escalate into downtime, data loss, or frustrated customers. For developers, DevOps teams, and SREs, the real challenge is not just writing efficient code but ensuring that the application continues to run flawlessly in production. This is where PHP monitoring tools play a critical role.

A Practical Guide for Developers: Preventing PHP Mistakes with Performance Monitoring

Performance is one of the most critical aspects of any PHP application. A few seconds of delay or an unnoticed bottleneck can cause users to leave your site, increase bounce rates, and reduce business conversions. For developers, ensuring top performance is not always easy. Small coding mistakes, inefficient queries can accumulate into major problems over time. Without visibility into what’s happening inside the application, it becomes difficult to identify the root cause of slowdowns or failures.

Your APIs Are Green. Your Background Jobs Are Dying.

Launch Week Day 2: Introducing Discover Jobs Your dashboard looks perfect. APIs responding in 80ms. Error rates at 0.02%. Kubernetes pods healthy. Everything's green. Then Slack explodes: "Why didn't my invoice generate?" "Where's my password reset email?" "The data export I requested yesterday is still processing?" You check your job queue. Sidekiq dashboard shows 47,000 jobs processed today. Redis looks fine. Workers are running. But somehow, your business logic is silently falling apart.

Kafka Performance Crisis: How We Scaled OpenTelemetry Log Ingestion by 150%

When your telemetry pipeline starts falling behind, the countdown to production impact has already begun. One Bindplane customer operating a large-scale log ingestion pipeline built on the OpenTelemetry Collector and Kafka hit that breaking point. Instead of keeping pace with incoming data, their pipeline was ingesting just 12,000 events per second (EPS) per partition/collector—and this Kafka topic had 16 partitions. In aggregate, that was roughly 192K EPS.

Early Warning Signals now available via Webhooks

We’re excited to announce that Early Warning Signals — proactive alerts that notify you of potential service issues before official acknowledgment—are now fully supported in StatusGator Webhooks. With Early Warning Signals delivered through your webhook integrations, you can detect early signs of trouble and act before a full incident is posted. This means more time to prepare, fewer surprises, and better uptime for your customers.

Pioneering DEX Agents and Benchmarks

At Nexthink, our focus is Digital Employee Experience (DEX), it’s all we do, and all we aim to be the very best at. Today, we have a unique opportunity to deliver the world’s most advanced DEX models and agents, fine-tuned and trained specifically on real DEX use cases from our thousands of users. This matters because, in our vision, most IT operations will eventually be fully automated by AI and technology.

Elastic Powers GitHub's Seamless Developer Experience

David Tippet, Search Engineer at GitHub, shares how Elastic powers GitHub’s massive search platform and enables a seamless developer experience. He explains how GitHub balances AI-driven semantic search with traditional keyword search, ensuring accuracy for millions of diverse users, from engineers to security researchers.

IT Security and Compliance Guide

This guide provides a comprehensive overview of IT compliance and the part it plays in IT security. It will also help you choose the right compliance reports tool for your company. As you get started, SolarWinds Security Event Manager (SEM) comes highly recommended as a near-automated IT security compliance solution that enables you to verify IT compliance and helps you perform many compliance-related IT operations.

10 Best PCI Compliance Software and PCI DSS Tools

PCI DSS is an industry security standard existing primarily to minimize the risk of debit and credit card data being lost. This is in the interest of both the customer and the merchant, because if data is lost or misused, the merchant could be subject to legal action. To protect yourself and your customers, you first need to understand the six PCI DSS control objectives and how to meet them.

Ultimate Guide to PCI DSS Compliance Requirements

When you make a credit card transaction, the last thing you want to think about is your data getting stolen. Fortunately, credit card companies put several measures in place to make sure this doesn’t happen. For businesses dealing with customer payments, PCI DSS compliance measures are a simple and necessary step in making sure customer credit card data is well protected. Ensuring PCI compliance can be a complex undertaking.

Why Alert Fatigue is a Major Challenge in Observability (2025 Survey Insights) | Grafana Labs

Over 1,200 engineers, leaders, and teams shared their biggest observability challenges in our third annual Observability Survey — and the results are in. In this video, Marc Chipouras (Head of Emerging Products, Grafana Labs) breaks down the top insights: Thanks for watching!

The Observability Problem Isn't Data Volume Anymore-It's Context

For years, the observability industry has been obsessed with one thing: data volume. We've built incredible pipelines, optimized agents, and scaled storage to handle petabytes of logs, metrics, and traces. The promise was simple: collect more data, get more visibility. But we've hit a wall.

How to monitor Claude usage and costs: introducing the Anthropic integration for Grafana Cloud

Generative AI is becoming a core part of modern applications, making it essential to monitor and manage how these services are used. That’s why, today, we’re excited to introduce the Anthropic integration for Grafana Cloud, a new solution that lets you connect directly to the Anthropic Usage and Cost API from within Grafana Cloud.

Improving the Developer Experience by Monitoring Third-Party Outages

The role of third-party SaaS and cloud services in the modern software development stack needs no explanation. Primarily due to the ease of setting up and hooking them together, they make the software development lifecycle (SDLC) much easier than it was 10 years ago. No more managing the overhead of installing, configuring, maintaining, backing up, and scaling of source code repos, virtual machines, and CI/CD systems. Some services don't have any in-house options, e.g. payment gateways.

From SEO to AEO: Why Web Performance Is the Key to AI Search Success

Search isn’t what it used to be. The way people discover information online is shifting. Instead of clicking through search results, many now ask AI answer engines like ChatGPT and Perplexity to do the research for them. In March 2025, 13.1% of Google desktop searches featured AI Overviews— doubling from over 6% in January, according to Semrush analysis of 10+ million queries.

What is Real User Monitoring

Real User Monitoring (RUM) measures how real users interact with your application in production. Unlike synthetic monitoring, which relies on scripted tests, RUM collects data from actual sessions. This means performance is observed across different devices, networks, and usage patterns. The result is a clear view of how the application behaves under real conditions, where latency is introduced, which features take longer to load, and at what points users drop off.

AI-Driven Application Monitoring with Checkly and Claude Code

In this webinar, Stefan Judis (Developer Relations at Checkly) and Dan Giordano (VP of Marketing at Checkly) dive into how LLMs and AI tools can be used with application monitoring. You’ll see a live demos of integrating Claude Code, Playwright MCP, and Checkly’s Monitoring as Code. ⸻ Timestamps ⸻ Resources & Next Steps ⸻ Subscribe for more sessions on application reliability, testing, and AI-powered DevOps!

Why (Enriched) Flow Data Belongs in Every Network Operator's Daily Toolbox

Flow data has always held immense potential, but was often inaccessible because it lacked context and speed. Kentik removes that friction by automatically enriching flow with human-readable context, making it a daily driver for everyone, not just specialists.

How ScienceLogic Supports Zero Trust and FedRAMP-Secure Operations

Cybersecurity leaders across the public sector are facing a moment of reckoning. Whether at the Department of Defense, a federal agency, or a public university, IT teams are under pressure to defend sprawling infrastructure, detect fast-moving threats, and prove compliance across multiple frameworks—all with fewer resources and tighter timelines. This challenge has accelerated interest in Zero Trust Architecture (ZTA), a paradigm shift in how we think about security.

Tracking Errors in Absinthe for Elixir with AppSignal

GraphQL provides a powerful approach to building APIs, and Absinthe is the leading GraphQL implementation for Elixir applications. While GraphQL offers many benefits, it can introduce a set of errors and performance bottlenecks that might be challenging to track and debug. In this article, you’ll learn how to use AppSignal to monitor, debug, and resolve errors in your Absinthe-based GraphQL API.

How to use AI tools more effectively: Tips from Datadog Engineers

A growing number of engineering organizations have adopted or are trialing agentic AI-based coding tools and LLMs in an effort to increase their teams’ development velocity. If you’re a developer, this means you’ve likely had to try out different agentic tools and models and determine how to best incorporate them into your existing workflows.

How to Build a Strategic Roadmap for Site Reliability Engineering Implementation

Getting your site reliability engineering solutions in place can seriously boost how your systems perform. But implementing site reliability engineering (SRE) isn't a simple flip of a switch-it's a process. If you want to keep your systems running smoothly, with minimal downtime and top-notch performance, you need a solid, strategic plan. This roadmap should guide you step-by-step, from setting clear goals to constantly improving your processes.

Major Opportunities and Technologies in Business HVAC Operation

The backbone of comfort, energy efficiency, and indoor air quality of buildings depends on commercial HVAC systems. Efficient environmental conditions in office buildings, manufacturing plants, and much more are crucial to the functionality of such systems. Yet, commercial HVAC operations have their challenges as well, and a new wave of technologies is enabling operators to meet them.

Investigate Problems With Mobile Frontend Observability

You can use your mobile tools to debug errors, but are you really looking at the root cause? With end-to-end observability, powered by Honeycomb's Mobile Android and iOS SDKs, you can see everything! We'll show you how to start from a mobile launchpad, view the errors, select a trace, and find that root cause.

Monitor Claude usage and cost data with Datadog Cloud Cost Management

Managing the cost of foundation models is a critical challenge as AI adoption surges, particularly for teams using powerful models like Anthropic's Claude Opus and Claude Sonnet. Growing teams generate larger prompt volumes and escalating model complexity, making it difficult to have clear visibility, accountability, and control of cloud AI spending.

The Starlink Outage and Its Impact on Community Gateways

Last month, Starlink suffered its largest outage in years, arguably its biggest since becoming a major internet provider. In addition to the millions of individual customers around the world, the outage disconnected the Community Gateways, customers of Starlink’s new transit service. In this post, we delve into the outage and its impact on these far-flung networks.

The Service Discovery Problem Every Developer Knows (But Pretends Doesn't Exist)

Launch Week Day 1: Introducing Discover Services Picture this: It's 2 AM, alerts are firing, and you're staring at a dashboard trying to figure out which service is causing the cascade of failures. Your service map is a six-month-old Miro board, and you have no idea what's actually talking to what in production right now. If you've been there, you're not alone. In fast-moving teams, new services get deployed faster than you can track them.

What is the User Lifecycle & How Can IT Teams Manage It?

It’s Monday morning, and a new hire is walking into the office for their first day. Before they can dive into the work, they need access to email, project management tools, cloud storage, and a dozen other SaaS apps their role depends on. IT has already been hard at work behind the scenes, provisioning accounts, assigning permissions, and making sure everything is ready the moment they sign in.

The 15 Best DevOps Monitoring Tools for Lightning-Fast Incident Response

When incidents strike, every second counts. The difference between a minor hiccup and a major outage often comes down to how quickly your team detects and responds to issues. That's why choosing the best DevOps monitoring tools for incident response can make or break your operational excellence. Modern DevOps teams need more than just basic uptime checks.

What is SNMP (Simple Network Management Protocol)?

The Simple Network Management Protocol (SNMP) sure does pack a punch for something with “simple” in its name, as it literally provides the lifeblood of network monitoring and device communications. Network admins rely heavily on SNMP because nearly every technology manufacturer supports the protocol. And, in turn, it enables them to collect information, configure devices and receive alerts about network performance and issues.

Choosing the Right PHP Monitoring Tools: A Practical Guide

When it comes to building fast, reliable, and user-friendly PHP applications, performance and stability are everything. A small slowdown in load times, a memory leak, or unhandled errors can frustrate users, impact revenue, and harm your brand’s reputation. This is why PHP Application Monitoring has become a necessity for businesses of all sizes.

Honeycomb Launches Integration With the Anthropic Usage and Cost API

If your organization is anything like ours, then you’ve probably embraced using large language models like Claude. Just last week, we gave all Honeycomb employees access to Claude. Now, developers can generate AI-assisted code, product managers can perform analysis on customer usage trends, marketers can test messaging, sales can do customer discovery and we are shipping AI-powered features to improve user experience.

Scale Your Monitoring Solution With the VictoriaMetrics Ecosystem

When it comes down to scaling time series monitoring solutions things can get messy. That’s one of the reasons why VictoriaMetrics, a Silver member of the Cloud Native Computing Foundation (CNCF), started its journey some years ago. It is a simple, reliable and efficient set of Observability Solutions that's been adopted by many organizations. It's open source, with a strong community behind it, with enterprise and managed (Cloud) options for those who need support. VictoriaMetrics plays well with many standards, including Grafana and OpenTelemetry. Apart from that, in case you didn’t know, VictoriaLogs is the new kid in the block that's seriously outperforming other solutions. In this presentation, we’ll present the VictoriaMetrics Open Source projects and how they differ from other solutions, especially when it comes to scaling from single small setups to massive cluster deployments. Come learn how VictoriaMetrics projects can help to ease Observability!

Run Checkly Monitors Against Multiple Environments

Learn how to run Playwright tests across different environments without rewriting them. This tutorial covers managing environment variables in Checkly for API and browser checks, handling global and group-specific settings, and integrating with CI/CD processes. Discover the best practices for setting up environment variables, duplicating test groups, and customizing alerts to ensure your checks are environment-specific.

Building a K12 IT Command Center: Monitor All Your Educational Services

Managing technology in K-12 schools has become increasingly complex. With dozens of educational platforms, administrative systems, and communication tools running simultaneously, IT teams need a comprehensive k12 it monitoring dashboard to maintain visibility across their entire technology ecosystem.

How to Effectively Monitor Kubernetes in 2025

As Kubernetes environments continue to grow in scale and complexity, having a robust monitoring strategy is no longer just good practice, it’s essential for survival. For engineering teams in 2025, effective monitoring and observability is the bedrock of performance, reliability, and cost control. This guide dives into the critical aspects of modern Kubernetes monitoring, from key metrics to the top tools/frameworks and the rising role of AI in managing these complex systems.

Taming Alert Chaos: Modern Incident Alert Management Strategies

Every IT team knows the feeling: your phone buzzes at 3 AM with yet another alert. Is it critical? Can it wait until morning? With dozens of monitoring tools and hundreds of potential failure points, incident alert management has become one of the most challenging aspects of maintaining reliable systems.

Why SSL Certificate Verification Failed: All Causes, Fixes & Prevention

SSL Certificate Verification Failed errors are one of the most common and frustrating issues for developers, DevOps engineers, and system administrators. Whether you're building a Python application, running a Docker container, or managing a web server, this guide will help you.

Integrating Deno and Grafana Cloud: How to observe your JavaScript project with zero added code

Andy Jiang is a JavaScript engineer with nearly 10 years of experience. He’s interested in making JavaScript and TypeScript simpler to use. He currently works at Deno as a product marketing manager. Outside of work, Andy likes cooking, writing, and playing tennis. Observability is essential for modern applications. Metrics, logs, and traces allow you to troubleshoot production issues, monitor performance, and understand usage patterns.

Scale Observability, Streamline Operations with AppNeta Monitoring Policies

In today's sprawling enterprise environments, keeping the network running smoothly isn’t just a technical hurdle—it’s a logistical marathon. Enterprise IT environments are in constant motion. New employees come on board. Contractors rotate in and out. Departments roll out new tools. Corporate offices expand, consolidate, or close. And users demand flawless connectivity from wherever they are.

How ScienceLogic Drives FedRAMP-Authorized Automated IT at Scale

As Government agencies modernize IT operations, many are adopting hybrid cloud and multi-tenant environments to drive agility and resilience. But as environments scale, so does complexity, especially when aligning with overlapping frameworks like FedRAMP, NIST, and CMMC. Today’s cybersecurity landscape—rising threats, shrinking budgets, and expanding compliance demands—requires more than manual oversight.

All Network Monitoring Tools Are Created Equal, Right?

There’s a question I hear quite often in my conversations about network management: "Aren't all network monitoring tools basically the same?" Honestly, I understand why so many people feel this way. For as long as I remember, the primary role of these tools has been to tell you when something is already broken. Your team gets an alert—a switch is down, an application is slow, a circuit is saturated—and the fire-fighting process begins.

How to Monitor Multiple School Platforms: Google Workspace, Canvas, and PowerSchool from One Dashboard

Managing technology in K12 schools means juggling dozens of critical platforms simultaneously. When Google Workspace goes down during morning classes, Canvas experiences issues during exam submissions, or PowerSchool becomes unavailable during grade entry periods, the impact ripples through entire school communities. The ability to monitor multiple school platforms from a centralized dashboard has become essential for educational IT teams.

How to Adjust Semantic and Lexical Search Weights in Elasticsearch

In this session, we’ll show you how *hybrid search using Elastic* lets you assign weights to different search types — for example, giving semantic search three times more influence than lexical search. This lets you fine-tune the balance between precise keyword matching and broader, context-aware results.

How Elastic Powers Search in Real-Time (Explained in 52 Seconds)

Ever wondered how Wikipedia loads answers instantly? Or how does your Uber update in real-time? That’s Elastic Search working behind the scenes. In this video, I break down how Elastic powers lightning-fast, scalable search for complex data from ride requests to stock prices.

Visualize Logs Alongside Metrics: A Complete Guide for Monitoring Slow MySQL Queries

When a service slows down, metrics will tell you that it’s happening but logs tell you why. For MySQL, slow queries can be a silent performance killer, gradually chewing through resources until users start complaining. By enabling MySQL’s slow query log and forwarding it to Loki (via Promtail), you can visualize query-level details right alongside your metrics on Grafana dashboards. This makes it easy to correlate what is slow (metrics) with what is causing the slowdown (logs).

Real-World Use Cases for Natural Language Copilots

Natural language copilots are one of the most exciting developments in AI for network operations. They allow engineers and operators to query complex environments in plain language rather than memorizing obscure CLI commands or digging through multiple dashboards. But here’s the truth: a copilot is only as good as the AI behind it. Without a purpose-built network LLM, a copilot can’t deliver the accuracy, context, and speed that real-world IT operations demand.

Announcing the Winner of the 2025 StatusGator Women in Tech Scholarship: Lara Djukic

Earlier this year, we launched the StatusGator Women in Tech Scholarship to support and empower women pursuing careers in technology. We are thrilled to announce that our 2025 scholarship recipient is Lara Djukic, an inspiring young technologist whose vision blends innovation with a deep commitment to her community. Through the Bold.org scholarship platform, we’ve award Lara a $3,100 scholarship.

The first rule of DORA Metrics...

DORA Metrics are widely regarded as the gold standard for measuring the performance of software development teams. The metrics themselves though are generic, high-level pointers – they are not an instruction manual. Adopting the DORA approach is the first step down the path to continuous improvement. The next steps are deciding how the measures should be defined in the context of your own organisations processes and then figuring out how to retrieve (and present) the relevant data.

Top tips: Beating notification fatigue before it beats you

Top tips is a weekly column where we highlight what’s trending in the tech world today and list ways to explore these trends. This week, we’re looking at the rise of notification fatigue and how to manage alerts so they boost productivity instead of draining it. You’re in the middle of a task, fully focused, when ping!—a new email lands. You glance at it, thinking it’ll only take a second, but by the time you get back to your work, you’ve lost your momentum.

What Is an MCP Server?

Ok MCP server, If you’ve been following AI development lately, you’ve probably heard whispers about “MCP Servers” floating around developer circles. It’s been around a little while now, and I myself have finally gotten round to using it. Boy, do we need to talk about it. MCP (Model Context Protocol) is Anthropic’s open standard that lets AI assistants connect directly to your tools and data sources, not just static documentation or code snippets.

Inside the Coralogix AI Center: Solving AI's Silent Failure Crisis

Observability has always answered one core question: Is it running? But in the era of LLMs, autonomous agents, and AI-powered workflows, that’s no longer enough. We need to ask a harder, scarier question: Is it right? And right now, most teams can’t answer that. Let’s fix it. In our last post, “The AI Monitoring Crisis No One’s Talking About,” we outlined why prompt injection, hallucinations, and context drift create invisible failures.

Real-Time Status Monitoring for 50+ EdTech Tools K12 IT Teams Actually Use

K12 IT departments face a unique challenge: keeping dozens of educational technology platforms running smoothly while teachers conduct lessons and students complete assignments. A single service outage can disrupt hundreds of classrooms simultaneously. That's why implementing a k12 service status dashboard has become essential for school technology teams managing complex digital learning environments.

Introducing the Coralogix Transactions processor

Coralogix Transactions are a trace segmentation strategy, unique to the Coralogix platform. They allow users to analyze the performance, over time, of a collection of related spans, across billions of traces. Coralogix has introduced a transactions processor into the OpenTelemetry contrib image, enabling users to activate this unique feature using nothing more than OpenTelemetry configuration.

LogRocket - The Ultimate Toolkit for Front-End Insight and Performance

When you need to get beyond surface-level metrics and see what users are actually going through when using your web application, LogRocket provides a potent set of tools. It was built with designers, developers, marketers, ecommerce managers, and web site owners in mind and in a nutshell it combines session replay, error tracking, product-level analytics, and AI-driven insight all in one place.

Sentry MCP server monitoring

We just launched MCP server monitoring in beta. You can instrument most server-side JavaScript SDK based MCP servers with one line of instrumentation code within your MCP SDK implementation using: wrapMcpServerWithSentry(McpServer) See details like protocol usage, client usage, traffic, tool usage, and performance across your MCP implementation so you you can get visibility into all the sharp edges that your MCP server has — who’s using it, how it’s working (or not), and get alerted when things break.

Network Switch Monitoring: How to Monitor Switch Performance with SNMP

If you’ve spent any time managing networks, you know the switch is the backbone that keeps everything connected, but it’s easy to take them for granted until something breaks. Monitoring network switches isn’t just “nice to have”; it’s critical if you want to avoid those sudden outages that bring everything to a halt.

AI in observability at Grafana Labs: Making observability easy and accessible for everyone

Did you know that observability has been around for more than six decades? It all goes back to a Hungarian-American inventor named Rudolf Kálmán who thought about how external outputs could measure the internal state of a machine. Kálmán wrote about monitoring single-input single-output systems, but our demands are very different today. We need to observe monoliths, microservices, clusters, pods, regions, and many more.

Early Warning Signals: Now in Microsoft Teams

As promised, we’re continuing to expand our Early Warning Signals coverage. In addition to our recent integrations for Slack, SMS, and Webhooks, we’re excited to announce that Early Warning Signals now works in Microsoft Teams. This is another step toward making early outage alerts accessible wherever your conversations happen.

Mastering Service Configuration in Icinga Director

The Icinga Director configuration tool makes it easy to define monitoring objects through the web UI and deploy them to the Icinga 2 API. In this blog post, I’ll walk you through how to configure services in Icinga Director. If you haven’t used Icinga Director yet, take a look at our introduction. I assume that most of you are already familiar with Icinga 2 and have used the DSL to define objects.

Optimizing PHP-FPM: Tips to Boost Your PHP Application Performance

When your PHP-based application starts attracting thousands of visitors, the way you run PHP becomes critical. A slow-loading page or a server crash during peak hours can cost you revenue, users, and reputation. PHP-FPM (PHP FastCGI Process Manager) is the default way most high-performance websites run PHP. While its default configuration works fine for small to medium workloads, high-traffic applications need custom tuning to handle large volumes of requests efficiently.

Optimize Your E-Commerce Platform with PHP Performance Monitoring

In e-commerce, seconds can mean millions. A one-second delay during checkout can slash conversion rates by 7% and send frustrated customers straight to your competitors. Most modern e-commerce platforms, such as Magento and WooCommerce, and Laravel-based solutions, run on PHP, making PHP application performance monitoring (APM) not just a nice-to-have, but a revenue-critical necessity.

You built the MCP server. Now track every client, tool, and request with Sentry.

TL;DR - Starting today, you can instrument most server-side JavaScript SDK based MCP servers with one line of instrumentation code within your MCP SDK implementation. Click to Copy Click to Copy With this in place, you’ll be able to see details like protocol usage, client usage, traffic, tool usage, and performance across your MCP implementation.

Simplify XML log collection and processing with Observability Pipelines

In Microsoft-based environments, Windows event logs capture critical security events like user logins, privilege escalations, and system changes. These logs are vital for compliance and investigations. However, they’re natively formatted in XML, a verbose and deeply nested structure that is hard to search without preprocessing and inefficient to store.

AI for Grafana onboarding: Get your teams started quicker with Grafana Assistant

Grafana puts a powerful set of observability capabilities right at your fingertips, but onboarding entire teams to the sophisticated platform is often a nontrivial exercise—one that can slow adoption and prevent organizations from getting immediate value. We want to make the process as frictionless as possible, which is why we’re excited to tell you that Grafana Assistant is now available in public preview to all Grafana Cloud users.

REST easy with REST Packs

The countdown to CriblCon 25 is on and we’re giving you an exclusive first look at the expert insights, innovative solutions, and success stories you’ll see on the big stage. REST collector configuration can be painful, requiring navigating to multiple screens and importing multiple configuration files, but it’s about to get a lot easier. Join Cribl experts to preview how easily you can install and build new packs with new enhancements.

Getting Started with Grafana Cloud's AI Assistant for Observability

The pace of software delivery in 2025 is unprecedented — cloud-native apps, microservices, and AI-generated code are shipping in days, not months. But one challenge never changes: ensuring reliability and visibility when systems fail. In this video, we explore how the new Grafana AI Assistant brings true, context-aware observability to your stack. Watch as we deploy an open-source Python service with Kafka, Postgres, Kubernetes, and Prometheus then use the AI assistant to instantly generate dashboards, alerts, and reduce un-needed telemetry volume.

The IT story behind 911 emergency services

At 2:37am on a cold Oregon night, a fire alarm blared at a rural station. Seconds later, the call came in: a structure fire on the outskirts of Rogue Valley. But what if that alarm never reached the station? This isn't a hypothetical. For the IT team at Emergency Communications of Southern Oregon (ECSO 911), it’s the kind of emergency scenario they prepare for every day.

How ELSER Transforms One Keyword into Better Search Results

In this session, we’ll show you how Elastic's ELSER takes a single token like _“Terminator”_ and expands it into semantically related terms such as _software, alien, computer technology,_ and _Connor_ (for John Connor). This makes search results more relevant, even when the exact keyword isn’t used.

How to Monitor NVIDIA GPU Metrics with Cribl Edge & Stream (Complete Tutorial)

If you’re running AI, ML, or data-intensive workloads on GPUs, monitoring their performance is critical. Overheating, under-utilization, or memory bottlenecks can cost you thousands in cloud bills and potential downtime. This guide walks you through collecting real-time GPU telemetry using nvidia-smi, sending it to Cribl Edge, routing it through Cribl Stream, and using Cribl Search to analyze the data—step by step.

Public vs private status pages [cost analysis, security, compliance, and more]

When your service goes down at 3 AM, how do you communicate with your customers? This question keeps DevOps teams and customer success managers awake at night, and for good reason. The way you handle incident communication can make the difference between retaining customer trust and watching it evaporate. Status pages have become the standard solution for incident communication, but there's a critical decision every organization faces: should your status page be public or private?

Data Center VXLAN Overlay Visibility at Scale

VXLAN overlays bring flexibility to modern data centers, but they also hide what operators most need to see: true host-to-host and service-to-service traffic. Kentik restores that visibility by decoding VXLAN from sFlow, exposing both overlay endpoints and underlay paths in a single view without the cost and complexity of pervasive packet capture — the result: faster troubleshooting, smarter capacity planning, and confident operations at scale.

Monitor the Performance of Your Node.js Fastify App with AppSignal

Fastify stands out among Node.js web frameworks for its obsessive focus on performance and boasts impressive benchmarks, with throughput often 2-3x higher than Express and other popular alternatives. But here's the paradox: without proper visibility, even applications with a good foundation will degrade over time as you add features and complexity.

5 PCI DSS File Transfer Requirements You Can Meet With Serv-U

Compliance with the Payment Card Industry Data Security Standard (PCI DSS) is essential for any organization that handles credit card data, and it extends far beyond databases and payment gateways. One area often overlooked is file transfer workflows, which can pose serious risks if not properly secured.

Network Visualization: 4 Ways to Visualize Computer Networks

Network visualization is the process of visually representing networks of connected entities, like devices, data flows, or relationships, using nodes and links. This technique helps in understanding complex data, identifying patterns, and improving network management by providing a clear visual overview of the network’s structure and behavior.

LLM-powered insights into your tracing data: introducing MCP support in Grafana Cloud Traces

Distributed tracing data is a unique and powerful observability signal, allowing you to understand how your services interact and the relationships between them. Sometimes it can be difficult, however, to turn raw tracing data into actionable insights. This is exactly why we introduced Grafana Traces Drilldown, an application that lets you quickly investigate and visualize your tracing data through a simplified, queryless experience.

Elastic wins 2025 Google Cloud DORA Award for Architecting for the Future with AI

Applying DORA principles to improve software delivery and operational performance with Google Cloud We’re thrilled to announce that Elastic has been honored with the 2025 Google Cloud DORA Award for Architecting for the Future with AI. Google Cloud DORA awards recognize organizations that have demonstrated significant advancements by applying DORA principles to improve their software delivery and operational performance with Google Cloud.

Site Reliability Engineering vs DevOps: Which Approach Fits Your Organization?

Choosing between Site Reliability Engineering (SRE) and DevOps can feel like picking between two similar but distinct philosophies. Both aim to improve software delivery and system reliability, but they take different paths to get there. Understanding these differences helps you make an informed decision about which approach aligns best with your organization's goals, culture, and technical needs.

If you want to monitor reality, you have to monitor your users' perspective.

Not from your data center. Not from your internal network. Not from your controlled environments. Real users are on hotel Wi-Fi, public LTE, spotty networks, global cloud providers. To understand their experience, your monitoring needs to reflect their reality: location, device, network, context.

IT can save the planet

When we think about saving the planet, we usually imagine solar panels, electric cars, or governments making sweeping climate policies. Rarely do we picture rows of blinking servers in a data center or IT admins patching endpoints. Maybe we should. In today's world, the intersection between technology and sustainability is becoming impossible to ignore, and the IT industry is right at the center of it. The truth is, IT is both part of the problem and part of the solution.

Error Analysis in Honeycomb for Frontend Observability Now in Public Beta

You just shipped your latest frontend release. It passed QA, CI ran, and it looked great in pre-production. But now it’s live and users are hitting an unexpected error: TypeError: undefined is not a function in Chrome. Your error tracking tool flags the exception. You get a stack trace, some breadcrumbs, maybe a session replay.

Advanced PHP Monitoring for Enterprise Applications

During critical business periods, enterprise PHP applications can experience significant performance challenges, including slow page loading, workflow delays, and essential integrations timing out. As a result, operational efficiency declines, customer satisfaction decreases, and revenue streams are at risk. Enterprise PHP applications power complex business portals, SaaS platforms, internal tools, and mission-critical workflows.

What Is Log Monitoring (and Why IT Teams Are Shifting to Log Intelligence)

Your infrastructure isn’t confined to a single location anymore. It’s spread across clouds, containers, and on-prem systems, and every layer is spitting out logs: access attempts, performance spikes, error codes, config changes. That data is invaluable if you can find the signal in the noise. But with millions of logs flying by every day, that’s easier said than done.

2025 Buyer's Guide - Choosing Unified Infrastructure Monitoring

Unified infrastructure monitoring delivers a single, enterprise-grade platform to oversee hybrid environments, providing real-time insights and proactive health monitoring across on-premises, cloud, and edge systems. As 2025 brings new challenges with artificial intelligence (AI), edge computing, and hybrid complexity, SolarWinds stands out as a thought leader in unified infrastructure monitoring for enterprises.

Why AIOps Isn't Optional Anymore: The Metrics That Prove It

The CFO slides a single sheet of paper across the conference table, without saying a word. It’s not a budget approval or strategic roadmap—it’s a simple question written in red ink: “What’s our ROI on IT operations?” For too many IT leaders, this moment represents a reckoning. After years of investing in monitoring tools, staffing up operations teams, and implementing “best practices,” the measurable business impact remains frustratingly unclear.

Grafana Tempo: Performance Moonshots & MCP Server (Community Call August 2025)

We'll have Marty talking about Grafana Tempo Performance Moonshots and Joe will update us with what's new with the MCP Server! Have questions? Please bring them! Can't comment in the chat? You may need to create a channel. Grafana Cloud is the easiest way to get started with Grafana dashboards, metrics, logs, and traces. Our forever-free tier includes access to 10k metrics, 50GB logs, 50GB traces and more.

Inside a Cybersecurity War Room - SolarWinds TechPod 101

It's CSOC o'clock! In this episode, we dive into the high-stakes world of cyber defense with the manager of cybersecurity operations at a critical infrastructure organization. From ransomware threats and zero-day exploits to the rise of nation-state-backed Advanced Persistent Threats (APTs), our guest reveals how security teams manage 24/7 threats, the mindset it takes to thrive in cybersecurity, and why community collaboration is becoming essential in cyber warfare.

SNMP Device Monitoring: Feature Highlight - Obkio

Tired of noisy alerts and overcomplicated SNMP monitoring tools? Learn how Obkio’s SNMP Device Monitoring blends simplicity and intelligence, giving you fewer alerts, better insights and faster troubleshooting so you can resolve network router, switch and firewall issues in minutes. It always starts the same way. You’re managing your network; maybe it’s five devices, maybe it’s five hundred, and everything should be simple. But instead, you’re caught between two extremes.

Beyond the Pipeline: Data Isn't Oil, It's Power.

Originally published on Medium, this piece by Winston Hearn dives into a philosophical discussion on why the "data is oil" metaphor is no longer serving the tech industry. Hearn argues that by reframing our thinking to "data is power," we can better understand and manage today's complex data systems. ‍ For more than a decade, we in the tech industry have referenced a common metaphor: data is the new oil. It’s a concept that’s easy to grasp.

Build secure and scalable Azure serverless applications with the Well-Architected Framework

Serverless platforms like Azure Functions and Azure Container Apps make it easier to scale your applications without managing infrastructure. But successful serverless apps require thoughtful planning. They must be designed to account for cold starts, unpredictable scaling behavior, and ephemeral compute lifecycles, all while ensuring secure data handling and end-to-end observability across highly distributed components.

Why Visibility Is the #1 IT Priority in 2025: Tackling Shadow AI and Emerging Risks

AI adoption is progressing at a rapid pace. What started as a trickle of generative tools is now a flood of autonomous agents, custom copilots, and AI-powered SaaS, most of it entering the workplace faster than IT can keep track of.

What is Shadow AI & What Can You do About It?

Artificial intelligence (AI) is now embedded in everyday professional workflows — so much so that 46% of employees say they would continue using AI tools even if their organization banned them. The productivity gains are undeniable, but this widespread, unmonitored use of AI also introduces growing risks around data security, compliance, and governance.

Observability trends in Brazil: insights from our localized survey

Organizations in Brazil are eager to adopt some of the latest observability trends and technologies as they look to keep their software running as smoothly as possible, according to analysis of a micro survey recently conducted by Grafana Labs. Observability is an evolving space, and this is the first time Grafana Labs has run a Brazilian version of our annual Observability Survey.

Grafana Pyroscope: New eBPF profiler in Alloy & Source Code Integration (Community Call August 2025)

Christian is going to talk about the new eBPF profiler in Grafana Alloy as well as new Grafana Pyroscope Source Code Integration UI updates. Have questions? Please bring them! Can't comment in the chat? You may need to create a channel. Grafana Cloud is the easiest way to get started with Grafana dashboards, metrics, logs, and traces. Our forever-free tier includes access to 10k metrics, 50GB logs, 50GB traces and more.

PHP Performance Monitoring with Atatus PHP APM

PHP is used by millions of websites and applications around the world because it’s easy to work with and very flexible. But like any technology, PHP apps can run into problems like slow performance or errors that affect users and your business. Atatus PHP APM provides developers, DevOps engineers, and SREs with clear insights into what is happening inside PHP applications, helping them find and fix issues faster, improve performance, and keep things running smoothly.

APM vs observability: why your definitions are broken

Recently I was asked to offer my opinions on Application Performance Management (APM) and Observability (o11y) - how they overlap, compete, and conflict. I was just one of several folks who's ideas were solicited, so (understandably) some of my thoughts were left out of the original article. HOWEVER, I'm never one to let good words (or at least a lot of words) go to waste, so I thought I'd pull them together here.

Best Practices for Managing Multiple Vendor Dependencies

Modern businesses rely on dozens of third-party services to operate efficiently. From payment processors and cloud providers to analytics tools and communication platforms, these vendor dependencies form the backbone of your technology stack. When one fails, it can trigger a cascade of issues across your entire operation. Managing multiple vendor dependencies requires a strategic approach that combines proactive monitoring, clear documentation, and well-defined response procedures.

HTTP status codes? Here's a cheat sheet

Whenever you visit a website or click on a link, there’s a whole conversation happening behind the scenes between your browser and the web server. That conversation includes something called HTTP status codes and knowing what they mean can help you make a diagnosis, so to speak. Usually, everything goes smoothly (like a 200 OK), but sometimes things break (looking at you, 404 and 500).

Top 7 Application Performance Monitoring Tools

Your application is under constant pressure to deliver low latency, high reliability, and a smooth user experience isn’t optional. When performance drops, every second matters. Application Performance Monitoring (APM) gives you the visibility to spot issues before your users feel the impact. It also helps you understand what’s happening inside your stack, so you can track resource usage, pinpoint bottlenecks, and keep things running at peak performance.

Introducing Checkly Uptime Monitoring: A Fast and Affordable Way to Detect Infrastructure Downtime

Learn more about Checkly, the application reliability platform designed for modern engineering teams! Discover how Checkly enables you to quickly detect, communicate, and resolve production issues and explore the newly added uptime monitoring features, including URL, TCP, and heartbeat monitors. Configure and manage your entire monitoring setup using monitoring as code!

Introducing Logz.io Open 360 AI: The Next Generation of Observability Is Here

Traditional observability tools can’t keep up with modern complexity. Dashboard and alert-based approaches still rely heavily on manual processes, resulting in longer troubleshooting cycles, slower decisions, and higher MTTR. Engineering teams need something better. Today we’re launching Open 360 AI, the first observability platform designed for both humans and AI agents working together.

How To Use Alloy and Hosted Graphite's Loki to Store and Visualize Logs

In a modern DevOps environment, having just metrics or just logs is like trying to navigate with half a map because you’re missing important context that makes decisions faster and smarter. Metrics tell you what is happening (CPU spikes, request rates, failed logins) but logs tell you why it’s happening, with the timestamps to prove it.

Your APIs are up, but did the payment go through?

If your challenger bank is built on composable core platforms like Mambu or Temenos, this one’s for you. Composable platforms enable API-first integration with modular services, letting you launch, adapt, and grow products quickly. That makes API health a top priority — and it shows in our State of API Reliability Report 2025 (we’ve pulled out the key fintech findings for APAC below).

RUM measurements: Start with the data, discover the story

When something breaks in your application, a slow page, a spike in errors, or a drop in engagement, the typical response is to chase the symptoms. But what if we flipped that process? What if we started not from user complaints, but from actual performance measurements, collected from real sessions in real time? That’s exactly the idea behind Coralogix RUM Measurements.

Learn OpenTelemetry tracing through a grand strategy game: introducing Game of Traces

A trace always remembers! Okay, okay. I will try to keep my Game of Thrones references to a minimum throughout this post, but there is a lot of truth to that statement. In observability, a trace is the “when” and “where” of telemetry signals, allowing us to track the state of interactions between services within a microservice architecture. This makes traces the ideal observability signal for discovering bottlenecks and interconnection issues.

What Is a Telemetry Pipeline and Why It Matters in Modern IT

A practical guide for IT professionals, DevOps, security teams, platform engineers, and anyone who’s dealing with logs. In contemporary distributed systems, telemetry data—logs, metrics, traces, and events—serves as the primary mechanism for understanding internal system behavior. However, as system complexity increases, so does the volume and heterogeneity of telemetry.

Why MikroTik VPS Is a Smart Choice for Network Monitoring and Management

Managing complex, distributed networks is no longer optional; it's essential for business success. They are often used for remote offices and IoT deployments, and managing those without the right toolkit is too much pressure, as uptime, security, and scalability without overspending should be secured. If you buy MikroTik VPS, you can be surprised at how these constant headache-causing tasks are managed successfully and with minimal effort. All thanks to the features this technology has.

How IT Leaders Can Successfully Adopt and Manage SaaS Solutions

In recent months, there has been growing discussion among business and IT leaders around the rapid expansion of SaaS solutions. McKinsey’s recent report on the current state of SaaS notes that while the industry has experienced a slowdown, largely driven by economic factors such as rising interest rates and reduced IT spending by enterprises, it has seen a decade of rapid growth, with the market being valued at approximately $3 trillion in 2022.

GPT-OOS: A Secure Step Forward, But Not a Free Pass

The release of OpenAI’s new open-source model, GPT-OOS, has sparked a wave of excitement across the AI community. And rightly so. For organizations that want the benefits of generative AI without sending data out to the web, this is a compelling option. Running locally, GPT-OOS offers a level of privacy, control, and cost-efficiency that’s hard to ignore. It’s fast, lean and at least in its early benchmarks, surprisingly capable in coding, math, and STEM-heavy workloads.

The Ultimate Guide to Incident Management Tools in 2025

Incident management tools play a key role in helping organizations to effectively handle service outages. With so many incident management tools around with different feature sets, it's often difficult to find the one that is right for your needs. In this article, we attempt to make a list of incident management software available in 2025 with their features to help you arrive at the right one. We have focused on tools that have incident management capabilities.

Observing LlamaIndex Apps with OpenTelemetry + SigNoz

LlamaIndex has become a popular choice for building Retrieval-Augmented Generation (RAG) applications, helping developers seamlessly connect large language models with private or domain-specific data. But RAG workflows can be complex with slow retrieval times, irrelevant or inconsistent responses, and silent failures in the data pipeline can all degrade the user experience. That’s why observability is essential.

What Makes PHP Application Monitoring Tools Essential for Leading Industries?

PHP is one of the most widely used scripting languages for web development. From e-commerce platforms to government portals, PHP powers a large share of the web. However, as web applications grow in complexity, user expectations also rise. Slow page loads, broken features, or unresponsive sites can lead to lost revenue, lower engagement, and frustrated users.

What is PHP memory leaks? How can you detect and resolve with APM?

According to the 2025 PHP Trends Report, 31% of developers cited performance bottlenecks as a recurring issue and PHP memory leaks were among the top culprits identified by DevOps teams working with high-traffic applications. Imagine you're shipping an app that’s humming along smoothly during QA. But weeks after going live, you start noticing creeping latency and irregular job failures. You dig into the logs, tweak some queries, but the issue persists.

VictoriaLogs Practical Ingestion Guide for Message, Time and Streams

VictoriaLogs Practical Ingestion Guide for Message, Time and Streams This VictoriaLogs article serves as a quick way to grasp the core concepts of VictoriaLogs. It covers only the most important information from the documentation, along with common cases identified after troubleshooting many real-world scenarios. If you’re just getting started with VictoriaLogs, this is a great place to begin. For more in-depth or advanced details, refer to the official documentation.

How to use SQL to learn more about your Grafana usage

Wilfried Roset is an engineering manager who leads an SRE team, and he is also a Grafana Champion. Wilfried currently works at OVHcloud, where he focuses on prioritizing sustainability, resilience, and industrialization to guarantee customers satisfaction. Grafana needs a database to store all its objects, such as users, dashboards, or even data sources. Each time a user creates a dashboard, it results in a new row created in the database.

What Is Network Jitter and How It Affects Your Connection: Causes, Tests and Solutions

Streaming movies and series, VoIP, video conferencing, remote work, competitive gaming… the network shoulders ever more pieces of modern life, and it better not fail—otherwise we get like Michael Douglas in *Falling Down*. One of those issues is network jitter, which we’ll cover in depth here.

Migrating to Citrix Cloud Without Breaking the Business

Not every migration is about rushing to the cloud. More often, it’s about timing, precision, and ensuring that end-users remain unaware of any underlying change. The goal isn’t just to modernize. It’s to do so without disrupting what’s already working. One of our customers, a global feed company with over 1,000 daily Citrix users, reached out to us for guidance.

Top tips: The secret to a better workday? It's in the little things

Top tips is a weekly column where we highlight what’s trending in the tech world and list ways to explore these trends. This week, we’ll see how fixing small inconveniences at work can make things easier and help us get more done. “It is often the small steps, not the giant leaps, that bring about the most lasting change.” – Queen Elizabeth II It's the little changes in life that bring lasting effects. Small, incremental improvements often add to meaningful comfort over time.

Icinga DB Web Automation

Icinga DB Web Automation allows you to automate monitoring tasks and integrate them directly into your systems and workflows. It is possible to issue command actions without a browser. To do so, a form needs to be submitted by a tool such as cUrl. Every request you send follows the same permission rules and access restrictions defined in the web interface, so security and user roles still apply. Want to target specific hosts or services? Simply add filter parameters to the URL.

Visualizing Logs Alongside Metrics: A Practical Use Case

Security threats aren’t always loud and don’t always crash systems or trigger alarms. Sometimes they creep in quietly as a steady stream of unauthorized login attempts, slow brute-force probes, or unknown IPs scanning your server for vulnerabilities. These behaviors often show up in logs before they surface in metrics but if you're only watching logs or only tracking metrics, you're missing part of the story.

Introducing our new notification logs

One of the core features of Oh Dear is that we can notify you whenever we detect problems with one of your sites. Our notification system is quite powerful. We support many different channels (like email, Slack, Telegram, ... and a whole bunch more), and have fine-grained control over which events should trigger a notification. Today, we've added notification logs.

Get the Full Picture: AppSignal Adds OpenTelemetry Support

We're excited to officially launch our OpenTelemetry instrumentation. AppSignal is now able to expand our observability to a dozen popular languages, frameworks, and tools, giving customers the deep insights they need to monitor their entire stack. In this article, we'll show you how you can use AppSignal and OpenTelemetry to proactively monitor your app.

What is Network Management?

International businesses and near-citywide college campuses require effective network management solutions to minimize downtime, optimize performance and strengthen cybersecurity. In summary, network management helps maintain the efficiency, reliability and security of a local and/or cloud-based network. However, developing a viable network management strategy requires an understanding beyond its actions.

How DX NetOps Topology Streamlines and Optimizes Triage

Every network operator knows the feeling: a critical alert fires, and suddenly it’s all hands on deck. But instead of jumping straight to resolution, you find yourself sifting through irrelevant alerts, flipping between tools, and trying to assemble a puzzle with missing pieces. In today’s high-stakes, hybrid environments, that kind of delay isn’t just frustrating—it’s costly. When issues arise, fast, intelligent triage is a must.

Keep an eye on remote access to your Kubernetes infrastructure with Datadog Workload Protection

To improve efficiency and reduce cloud spending, teams frequently schedule pods on Kubernetes nodes dynamically, based on available resources. However, this practice has also introduced a new security challenge: The workloads maintained by a development team are now spread between Kubernetes nodes, exposing more hosts and increasing the blast radius when user credentials are compromised.

Getting started with Freshdesk dashboards

Freshdesk is a popular incident management system known for its ease of use, robust ticketing system, and powerful automation capabilities as part of the Freshworks suite of tools. While Freshdesk comes with native reporting and dashboards, they can be limited in terms of customization and data correlation across different sources. Additionally, building complex visualizations in Freshdesk often requires more advanced knowledge of their reporting tools. This is where SquaredUp comes in!

Using GreptimeDB as Prometheus Data Lake in Coroot

Coroot is excited to feature an editorial from the open source observability database GreptimeDB as an Open Source Spotlight. We hope to improve the work of our global community of SREs and DevOps professionals by sharing exciting projects like GreptimeDB, which make innovation accessible for everyone through the freedom of open source.

AI-driven alert triage and root cause analysis (RCA) that proactively responds to production alerts

Watch AI transform alert management in real-time. This technical demonstration compares manual alert investigation with AI alert investigation. It shows how AI agents automatically investigate production alerts, correlate telemetry across distributed systems, and identify root cause, faster and with more insights than manual processes. Watch and learn how to shift your team from reactive firefighting to proactive system reliability management with agentic AI.

What Your SD-WAN Isn't Telling You

Your SD-WAN is constantly making decisions. It assesses path quality based on metrics like packet loss, latency, and jitter, and steers traffic for your most critical applications accordingly. For this, it is an indispensable technology. But have you ever paused to ask a fundamental question: Is the path it chooses truly the best one available, or just the best one it can see from its limited vantage point?

Common Unity errors and how to fix them

Unity has a reputation for handing out surprises: the play-mode freeze just after a hot-reload, the sudden sea of pink materials, or the stack trace that politely reminds you your transform was null all along. Rather than letting those moments derail the rest of your sprint, this post rounds up four of the most common runtime offenders, and shows you exactly how to trigger, spot, and fix each one.

Tracing asynchronous systems in your event-driven architecture: When to use parent-child vs. span links

Asynchronous communication patterns are commonly used in distributed systems, especially in those that rely on events or messages to coordinate activity. Rather than responding to direct API calls like in a traditional request-response architecture, services in an asynchronous system produce, route, or consume events and messages independently.

How to build reliable and accurate synthetic tests for your mobile apps

Mobile applications offer increased flexibility to both users and developers. Users can access content on a wide range of devices, operating systems, and network types, while developers can leverage touch screens and orientation-based layouts to create more responsive features. However, all of these factors create new testing challenges. To ensure a good user experience (UX), developers have to test their apps across many device models and platforms, which can become costly and time-consuming.

Deletion protection in Grafana Cloud: a simple way to safeguard your observability stack

We’ve all had that “uh-oh” moment. You press Enter and your blood runs cold, as you realize you just deleted something critical. For engineering teams, this type of disaster takes many forms. For example, maybe you used a DELETE statement without a WHERE clause to delete a row in a database, and accidentally deleted all of them instead. To protect you from the accidental deletion of critical resources in Grafana Cloud, we’re introducing a feature called deletion protection.

Powering What's Next: ScienceLogic's Vision for Intelligent, Outcome-Driven IT

The observability market is changing rapidly. The days of simply collecting logs, metrics, and traces are giving way to something bigger: delivering actionable intelligence that actually connects IT operations with business goals. Organizations don’t just want to know what’s happening anymore; they need to understand why it’s happening, what actions to take, and whether their systems can respond independently.

Introducing the Coralogix SLO Center

Are you struggling to define reliability targets? Teams nowadays are turning to Service Level Objectives (SLOs), reliability targets that can be used to define how much you can play around with your systems before users are affected too much. While they're a great way of defining reliability targets, they are difficult to manage. That's why we built the SLO Center. One place to define, track, zoom into, and stay on top of all your reliability targets and error budgets - so you can be sure when you can experiment, and when it's best to stay safe.

Nothing about today's Internet stays in one place... so why does your monitoring?

Users are mobile. Apps are elastic. Traffic shifts constantly across clouds, ISPs, and geographies. Monitoring needs to adapt to that reality. You need visibility that moves with your users and your applications, wherever they go, however they connect. The Internet is now your application fabric. And your monitoring strategy should reflect that!

AI Replay Summaries in Sentry Arrive!

Replays in Sentry are awesome. With one property in your Sentry config you can start capturing video-like replays of user interactions with your application, but the problem is... you still have to watch them... but not anymore! AI replay summaries take your replays and run the events through an LLM to summarize the events that happened in them. They are broken up into chapters, with the breadcrumb sequences embedded in, so you can quickly get context of whats happening in every replay.

3 Signs You've Outgrown Scripts and Spreadsheets for Network Configs

In the early days of any IT operation, pragmatism rules. Most network teams start with what’s readily available—custom scripts, Excel spreadsheets, shared network drives, and tribal knowledge. It’s cost-effective and familiar. But as your organization grows, so does the complexity of your network. Devices multiply, configurations diversify, and the operational risk of keeping everything “stitched together” with manual methods increases exponentially.

Weaponized AI vs. AI Driven Security Posture Management: Why the Battle Starts in Misconfigurations

August 5, 2025, Las Vegas Black Hat 2025, Abnormal AI officially launched its Security Posture Management for Microsoft 365. This release marks a critical turning point. In an era where attackers weaponized AI to uncover and exploit misconfigured cloud environments at machine speed, reactive security simply can’t keep pace. Threat actors are now leveraging automated AI to scan systems, identify configuration drift, escalate privileges, and deploy zero‑day exploits in seconds.

Size-capped telemetry storage with ClickHouse and Coroot

Cloud platforms make it incredibly easy to store data. Object storage feels endless, and block volumes can be resized anytime. That’s great, until you check the cost. In some cases, like financial transactions, storage costs are tiny compared to the value of the data. But observability is a different story. Logs, traces, and profiles can be extremely detailed and often take up more space than the actual business data. Yes, there are situations where logs need to be kept for compliance reasons.

Boosting Session Replay performance on iOS with View Renderer V2

After making Session Replay GA for Mobile, the adoption rose quickly and more feedback reached us. In less great news, our Apple SDK users reported that the performance overhead of Session Replay on older iOS devices made their apps unusable. So we went on the journey to find the culprit and found a solution that yielded 4-5x better performance in our benchmarks.

Balancing Speed and Safety with Continuous Delivery

The benefits of continuous delivery are well known these days: rapid feedback, speed of innovation, reduced fault recovery time, and increased confidence in release processes. Along the same lines, those who release less frequently are likely to encounter more stress. Continuous delivery is a spectrum; it doesn’t have to mean blasting every commit to all production environments at once. So, how do we strike a balance between speed and safety?

Log Format Standards: JSON, XML, and Key-Value Explained

Your log format defines how your application records events. The structure you choose shapes how logs get parsed, indexed, and queried. It affects how quickly you can debug issues, build alerts, or control storage usage. In this guide, we'll take a look at the log formats developers typically use, the essential fields to include, and what trade-offs to consider before locking down a format for your system.

Manual vs. AI-Driven Alert Triage and RCA: Who Will Win?

Curious to see how AI actually performs in a real-world production scenario? Watch the webinar “AI-Driven Alert Triage and RCA” with Logz.io Customer Success Engineer, Seth King. Below, we also bring the main highlights of the webinar. AI claims to make engineers more efficient and agile, by shortening processes and surfacing insights that help drive decisions.

A guide to cloud unit economics

As you analyze your organization's cloud spending, you'll often find that stakeholders have different perceptions of what that spending brings you. This is especially true when overall costs are rising and it's hard to distinguish waste from valuable investments in growth. But when finance, engineering, and product teams can all connect cloud spending to specific business outcomes, you gain the ability to make data-driven decisions about how to maximize the value of that spending.

Network Visualization Tools: Key Features and Top 6 Tools in 2025

Network visualization tools are software applications that allow users to represent, explore, and analyze network structures graphically. These networks can include computer and telecommunication infrastructure, as well as social, biological, and organizational networks. Visualization is achieved by displaying nodes (entities) and edges (relationships), making complex datasets easier to interpret and manage.

Prevent cloud misconfigurations from reaching production with Datadog IaC Security

Modern infrastructure is built and deployed faster than ever, but increased speed can elevate risk. Developers who work on cloud-native applications often use infrastructure as code (IaC) to define cloud resources in configuration files, which are then shared across teams and deployed automatically. Although this approach is efficient, undetected misconfigurations in IaC can quickly introduce security risks into production environments.

Can External Data Predict System Failures?

Something critical just went down. Again. So you troubleshoot and find out everything's clean - logs, metrics, nothing seems out of the ordinary. You didn't think to look out the window, right? Let's rewind a couple of hours. The temperature spiked 15 degrees outside, the humidity was at 90% and a storm came out of nowhere. Meanwhile, your edge device is sitting in a box on a pole somewhere; it never stood a chance.
Sponsored Post

AI realism (part one)

Emotions are running high about AI technologies. In this 2-parter, I do my best to make a rational case on the reality of AI, and how we can respond to it. This is part one; part two next week. We seem to be struggling to have pragmatic discussions about advancements in Artificial Intelligence. It's hard to hear calmer voices over the detractors and breathless enthusiasts. Today, I want to make a reasoned, evidence-based case for the potential of this technology, glance at present and future applications, and offer some practical examples for implementing AI within an organization.

Pinpointing Logon Duration Issues with Precision: Game-Changing Enhancements in MetrixInsight for Citrix VAD/DaaS

At GripMatix, we’re committed to giving IT teams deep, actionable visibility into their Citrix environments, going well beyond what’s available in native tools like Citrix Director or Monitor. With our latest update to the Citrix User Experience (UEX) Analyzer in MetrixInsight for Citrix VAD/DaaS, we’ve taken diagnostics and troubleshooting to the next level by introducing powerful new metrics and insights.

PostgreSQL Performance: Faster Queries and Better Throughput

A PostgreSQL setup that performed well with 10,000 users starts to show strain at 100,000. Queries that once returned in under 50ms now take over 2 seconds. The connection pool regularly hits its limit during peak usage, leading to timeouts and degraded performance. This blog focuses on practical ways to reduce query latency by 50–80% and increase throughput for high-concurrency environments.

Goodput vs Throughput: The Differences and How They Affect Your Network

Two key metrics that often come up in discussions about network performance are throughput and goodput. While these terms may seem similar, they highlight different aspects of your network’s efficiency and misunderstanding them can lead to poor decision-making that can impact the way you manage your network and your business’ resources.

Resilience with Zero Data Loss in High-Volume Telemetry Pipelines with OpenTelemetry and Bindplane

This was the problem one Bindplane customer had with processing enormous S3-stored log files. Our engineering team tackled the problem head-on, enhancing the S3 event receiver with offset tracking and chaos testing methodologies.

Secure by Design: IT Modernization for Government

As government agencies modernize IT infrastructure, many are shifting to hybrid and multicloud environments. But this evolution brings heightened exposure to cyber threats. For the public sector, where data protection is tied to national security and public trust, compliance is more than a box to check—it’s the front line of defense. FedRAMP (Federal Risk and Authorization Management Program) provides a standardized framework for securing cloud services used by U.S. agencies.

Ten Minute Troubleshooting: Meet (and Monitor) Users Where They Are

What do you do if your monitoring, APM, and synthetic tools tell you an application is up, but the users say it’s not? A good first question is to ask where your monitoring tools are located relative to both the users and the application itself. In this episode Mursi helps Leon identify his “red-light, green light” issue and adjust his monitoring to do a better job showing the REAL user’s experience.

Behind the Dashboard - Catchpoint Traceroute

Behind the Dashboard is an ongoing series where we look under the hood of a specific Catchpoint feature. Each episode breaks down the technology itself, what’s challenging about using it for monitoring, and how we removed friction and toil to make it a valuable part of the Catchpoint platform. In this episode Leon, Brandon, and Sergey take a look at “traceroute” tests – a feature that may seem humble and unassuming, but has unexpected power and utility when it comes to identifying performance issues with your site, service, or application.

Coralogix SLO Center & SLO Alerts are now available

Coralogix has released a new flagship service management product, the SLO Center. The SLO Center allows customers to define service level objectives (SLOs) for their teams. SLOs can be defined across multiple services or metric streams. Powered by the Coralogix Streama engine, this unlocks full coverage SLOs for every team, regardless of volume and with very high cardinality limits.

Coralogix becomes first observability vendor to earn ISO/IEC 42001:2023 certification for responsible AI

We’re proud to announce that Coralogix is now officially ISO/IEC 42001:2023 certified, becoming the first observability vendor to achieve this globally recognized standard for responsible AI management. ISO/IEC 42001:2023 is the world’s first international standard for Artificial Intelligence Management Systems (AIMS). It provides a comprehensive framework for how organizations should govern AI, focusing on transparency, ethical use, accountability, and regulatory compliance.

New Feature - Vulnerable System Drivers Monitoring

Vulnerable system drivers continue to be a vector exploited by attackers to compromise systems. In eG Enterprise version 7.5 we added a number of periodic security checks to assist administrators proactively identify weaknesses, including vulnerable system drivers monitoring.This new capability is supported for a Windows OS, when using a VM agent for inside view monitoring and / or when monitoring an Azure Virtual Desktop session host.

Leaning into AI, ML, and observability to manage your ever-growing infrastructure

The complexity and scale of modern infrastructure requires an equally intelligent set of observability tools to effectively monitor it. Remember when scaling meant ordering new servers and racking them in a data center? Remember when cloud providers first offered access to seemingly infinite virtual machines at the click of a button? Remember when Kubernetes made it trivial for infrastructure to automatically scale itself based on demand?

New in Grafana Alerting: a faster, more scalable way to manage your alerts in Grafana

Effective alerting is the backbone of any observability strategy. But as your systems grow, managing hundreds or even thousands of rules can become a significant challenge. And when something goes wrong, the last thing you want is to fight with your tooling. That’s why we’re thrilled to announce the launch of our brand new alert rules list page, which we built to provide a faster, more intuitive, and scalable experience for teams of all sizes!

Getting started with MongoDB dashboards

MongoDB is a popular NoSQL database used by many modern web applications. Once your web application is up and running, you might find you need to monitor the application data for operational purposes. For example, you may need to report on user sign-ups, or monitor for problems like invalid data. SquaredUp is an easy-to-use dashboard that plugs directly into your MongoDB database to visualize and monitor your data.

Patterns for safe and efficient cache purging in CI/CD pipelines

"There are only two hard things in Computer Science: cache invalidation and naming things."—Phil Karlton In the age of increasingly frequent deploys, edge caching, and Jamstack adoption, caching plays a key role across the software delivery life cycle. In build and CI pipelines, caching compiled assets or dependencies helps reduce compute costs, speed up job runtimes, and lower the environmental impact (regarding energy usage) of repeated builds.

What's the easiest way to check my website's uptime?

Whether you're keeping a personal blog or manage a corporate site or online storefront, website downtime can cost money and can damage your reputation. Let alone when you're maintaining a bunch of different client sites. And while downtime can't always be prevented, it's really easy to at least keep track of things, and diagnose potential issues from there. So, let’s start with the easy part.

What are Application Metrics?

Application metrics are structured, quantifiable signals that reflect how your software behaves in production. They capture key aspects of performance, response times, error rates, throughput, and resource usage, giving you a real-time view into the health of your system. Tracking the right metrics helps detect regressions early, surface latent issues before they impact users, and guide optimization decisions based on hard data, not guesswork.

Top 5 EdTech outages detected by StatusGator in July 2025

July 2025 saw several significant service disruptions affecting the education technology (EdTech) ecosystem. From online learning platforms to creative tools used by teachers and students, these outages caused widespread frustration. StatusGator monitored and detected these incidents, providing early alerts to help schools and organizations stay informed.

Introducing Cribl Guard

Does sensitive data flowing through your network feel like a ticking time bomb? Well, it just might be. Legal mandates, security frameworks, and customer expectations have made the stakes higher than ever. One leaked spreadsheet of personally identifiable information (PII) can wipe out years of customer trust, rack up regulatory fines, and invite ransomware actors to your doorstep.

The Outage You Can't Afford: Why CMI/CME Providers Need Autonomous Operations Now

Imagine if degrading network performance—not just bad code—disrupted your live stream during a high-profile event. Customers start flooding support lines. Social media lights up. Your NOC teams scramble to identify the root cause amid fragmented systems. The outage impacts not only your broadcast, but also subscriber logins, ad delivery, and mobile apps. Advertisers want refunds. Executives ask, “Why didn’t we see this coming?”

Domain Expiry and Its Impact on SEO: How to Monitor and Prevent Lapses

Your domain name is your digital real estate. It is how customers find you, search engines rank you, and your brand builds trust in the digital world. Whether you run a small blog, an e-commerce store, or a large business, your domain is the foundation of your online activities. But what happens if you forget to renew it? A domain expiry can cause your site to go offline. It can also hurt your SEO rankings and affect your website traffic.

Building on the foundation of OpenTelemetry eBPF Instrumentation: what's new in Grafana Beyla 2.5

Earlier this year, Grafana Labs donated Grafana Beyla — our open source eBPF-based, zero-code instrumentation tool — to OpenTelemetry under the new project name OpenTelemetry eBPF Instrumentation. In addition to reflecting our deep and long-standing commitment to the OpenTelemetry project, the donation marked a significant milestone in the evolution of zero-code eBPF instrumentation within the open source community at large.

The Platform Engineer's Playbook: Mastering OpenTelemetry & Compliance with Mezmo and Dynatrace

The rise of platform engineering has put a new team at the center of the developer experience. These teams are tasked with building the "paved road" for developers, which includes providing a robust, self-service observability stack. However, they face a dual mandate: provide a great developer experience and manage the ever-growing costs and complexity of the tools involved.

The MSP's DNS Security Checklist

DNS is one of the most important and most overlooked layers in your client’s infrastructure. As an MSP, you’re often the one who gets blamed when something breaks—whether you control the DNS or not. And while many DNS problems are silent, their consequences are loud: email failures, website outages, and frustrated clients. This DNS security checklist will help you proactively identify and fix DNS risks across all your client domains.

Save Hours on Troubleshooting with Automated Investigations

How many times has your team stared at a dashboard, pointed to a spike, and asked a question that charts alone can’t answer? “What was the real impact of that deployment?” “Why are our Kubernetes pods in the us-east-1 cluster suddenly crashing?” “Are we wasting money on overprovisioned servers?” Answering these questions is the real work of operations and SRE.

Tracking Safety: The Role of Mobile Monitoring in Protecting Vulnerable Family Members

It's never been easier to stay connected with the people you care about. Thanks to smartphones and GPS technology, families now have powerful tools to protect their loved ones-whether they're across town or across the country. But these same tools raise important questions: how much should we monitor, and when is it necessary? Let's explore how mobile tracking can help safeguard the most vulnerable members of our families-from kids to grandparents-and how to use it responsibly.

How We Think About "Developer Marketing" at SigNoz

“Developers hate marketing.” Do they, really? I often hear this thrown around on podcasts about DevTools marketing, and while it’s true that developers don’t respond to the same old marketing tactics, they do respond to genuine communication. The reason developers are hard to “market” to is that they are also the builders of the stuff you want to sell.

Netdata Now Troubleshoots Your Alerts for You

The 2 AM pager alert. For anyone in Ops, SRE, or IT administration, those words trigger a familiar sense of dread. An alert has fired. Is it a real fire, or another false alarm waking you from a dead sleep? The pressure is on. Every minute of downtime costs money and reputation, but troubleshooting a complex system when you’re sleep-deprived is a Herculean task.

Incident Commander Role: Responsibilities and Best Practices

When a critical system goes down at 3 AM, the difference between a quick resolution and hours of costly downtime often comes down to one role: the incident commander. This person serves as the central coordinator during IT incidents, making crucial decisions that can save thousands of dollars per minute.

Applying AI/ML in Observability - Tech Talk #7

Ready to master anomaly detection? Join us for Part 2 of our "Applying AI/ML in Observability" series, where we do a deep dive into vmanomaly! In this live stream, Mathis and Marc will be joined by a very special guest: Fred Navruzov, the lead developer and mastermind behind VictoriaMetrics' vmanomaly. If you want to move beyond the basics and unlock the full potential of AI-driven observability, this is a session you can't afford to miss.

Automated Seer in Under 2 Minutes

What if you had 5 errors, and instead of coming back to 5 issues in your feed, you got 5 pull requests fixing them? Seer is Sentry's new AI Debugging agent. it's able to stitch together all the context from your logs, stack traces, distributed tracing, codebase, and issues and figure out what broke, where, and how to fix it. Seer automation lets you automate that flow - and end up with a nice PR waiting for you to merge if it looks good. Check it out!

Explore the NiCE MariaDB Management Pack in Action2025Q3

If you’re running critical MariaDB workloads and need reliable, performance-focused monitoring, this session is for you. You’ll get a live walkthrough of the Management Pack, learn how it integrates seamlessly into SCOM, and explore real-world use cases to improve your database monitoring strategy.

Selector MCP and the Future of Modular Automation

In the first two parts of this series, we explored why modern network operations demand intelligent automation and how AI agents can reason, act, and collaborate to solve complex problems. We examined the frameworks – such as ReACT, LangGraph, and Pydantic – that power these agents, and how the Model Context Protocol (MCP) facilitates seamless integration with tools and services. But theory alone doesn’t improve network uptime or reduce manual toil.

What Are Packet Bursts: Causes, Fixes & How to Find Them

Have you ever been in the middle of an important video call, only for it to glitch or freeze out of nowhere? Or did an application suddenly slow down right when you needed it most? These frustrating moments can often be caused by something hidden in the background: packet bursts. But what exactly are packet bursts, and why do these sudden surges in data traffic catch you off guard when your network seems steady? Are they just random spikes in the data flow, or is there something deeper causing them?

SLF4J and Log4j - Understanding the Differences

Good logging isn’t optional when building Java applications—it’s critical. Logs are often the first place we turn to when something breaks and are essential for performance tuning, security audits, and long-term maintainability. Two names come up in the Java logging conversation: Simple Logging Facade for Java (SLF4J) and Log for Java (Log4j). They sound similar and often work together, but they serve distinct roles.

Librato on Heroku is Going Away and Hosted Graphite Is the Better Next Step

Librato (a SolarWinds product) is being sunsetted summer of 2025, and that directly affects Heroku teams who’ve relied on the Librato add-on for “good enough” visibility into dynos, routers, and Postgres. If you’re in that group, you’ll need a replacement monitoring add-on that keeps you covered on Heroku and lets you grow beyond it without re-architecting how you ship metrics.

Jaeger Monitoring: Essential Metrics and Alerting for Production Tracing Systems

Your Jaeger setup is running. Traces are coming in, and the UI is helping you spot slow services or debug broken flows. But just like any part of your observability stack, Jaeger needs some basic monitoring to stay reliable. If the collector starts queueing spans or the agent runs out of buffer, it can lead to dropped traces, sometimes without any obvious sign in the UI. This blog focuses on the operational side of Jaeger.

Securing the Invisible: Why Ambient AI Needs Next-Gen Security

If, like me, you’re continuously striving to keep pace with the ever-evolving world of artificial intelligence, you’re probably hearing a lot about how Ambient AI is poised to dominate discussions and developments throughout the second half of 2025. Ambient AI refers to artificial intelligence systems that operate unobtrusively in the background of our daily environments, constantly sensing, analyzing, and responding to various inputs without explicit human interaction.