Operations | Monitoring | ITSM | DevOps | Cloud

Notes from the Field: Understanding "Lost connection" LAS activations in Citrix Virtual Apps and Desktops

With the transition from file-based licensing to the License Activation Service now complete, many Citrix administrators are spending more time in the Citrix Cloud licensing portal. As organizations continue to operate and troubleshoot LAS-based Citrix Virtual Apps and Desktops environments, it becomes increasingly important to understand what the licensing dashboard is actually showing.

Beyond polling: Why enterprises are exploring network telemetry

Polling has been the go-to approach for network monitoring for years, and it still plays an important role in keeping networks healthy. But as networks become more distributed, application-driven, and data-intensive, simply polling devices more often isn't always the most efficient way to gain deeper operational insights. That's where network telemetry comes in.

What's New in InfluxDB 3 Explorer 1.9: Flux-to-SQL Conversion, InfluxQL Support, and More

InfluxDB 3 Explorer 1.9 makes it easier to work with your existing queries. Whether you’re migrating Flux queries to SQL or you’ve been writing in InfluxQL for years, this release helps bring your existing queries forward instead of starting from scratch. For teams moving to v3 from earlier versions of InfluxDB, query migration is often one of the last major hurdles.

The Journey to Achieving Hyperscale Availability with AI-Driven Prediction

At hyperscale, a regional cloud outage is not merely a technical disruption—for Samsung Account, which serves 2.1 billion users across three global regions, it is an immediate global service crisis. Fragmented, region-siloed monitoring creates blind spots that make early detection nearly impossible, leaving SRE teams perpetually reactive rather than predictive. The path to proactive reliability requires both a philosophical shift and a foundational change in how observability data is collected, unified, and reasoned over.

Debug and evaluate your AI app from your coding agent with Datadog Agent Observability

Coding agents like Claude Code, Cursor, and Codex CLI handle the coding parts of building an AI application well. The harder work comes after: understanding why a response went wrong, building eval sets that reflect real production behavior, and keeping up with an application that changes faster than any one-off script can. Teams spend 60–80% of their time on evaluation and error analysis, and much of that work needs to be redone every time the stack shifts.

5 pitfalls to avoid when measuring DevEx in the AI era

Developer experience, commonly known as DevEx, describes how an organization’s systems, workflows, tools, and culture affect developer productivity. A positive DevEx leads to tangible organizational benefits, including faster releases, increased innovation, and reduced technical debt. Measuring DevEx enables engineering management to quantify their team’s impact and understand where to direct improvement efforts.

Datadog acquires Adaptive ML

Off-the-shelf models are easy to deploy, but they are rarely enough to solve complex, domain-specific challenges in production. The key to sustained AI value is not in the models themselves but in the ability to tune, evaluate, and refine those models against your organization’s real-time signals. We are excited to announce that Adaptive ML is joining Datadog to accelerate this vision by combining our deep observability data with their expertise in building specialized, high-performance AI agents.

Introducing Atatus MCP Server: Connect AI Agents to Your Observability Data

AI coding assistants like Claude, Cursor, Codex, GitHub Copilot have become standard tools in the modern engineering workflow. Developers use them to write code, generate tests, and review pull requests. But when something breaks in production, these assistants hit a wall: they have no access to your actual system state. They can reason about logs, traces, and metrics. They just can't see yours.

New in Skylar One - Kyoto: Helping IT and Business Teams Focus on What Matters Most

When technology works, businesses thrive. Employees stay productive, customers stay connected, and critical services keep running. But when something goes wrong, the real challenge is not only detecting the issue. It is understanding what it affects, who may fell the impact, and how urgently the business needs to respond. That is the value behind the Kyoto release. The latest Skylar One update helps teams better connect IT health to business impact.

Full-stack observability in Grafana Cloud: How to investigate issues across services and infrastructure

Many times, the hardest part of troubleshooting isn’t fixing the actual problem. It’s figuring out where to start. As engineers, it’s easy to lose count of how many times we’ve opened logs, then 10 metrics tabs, and another 10 tabs with trace queries, only to end up back in the logs trying to find a root cause.

6 Ways to Use the Hyperping MCP Server

When something goes down, the last thing you want is to alt-tab between a monitoring dashboard, your on-call tool, and three Slack threads to figure out what is happening and who owns it. That context is usually all there. It is just scattered. The Hyperping MCP server fixes that by putting your monitoring data inside the AI tools you already work in. Your agent can read monitor state, outage timelines, SLAs, and on-call schedules, and answer the questions you would normally chase across tabs.

What Customers Are Doing With AI and Honeycomb

At O11yCon, we talked to engineering teams across the industry, and the numbers are starting to get genuinely wild: Mixpanel DevOps Engineer Eddie Bracho told us their engineering team is generating 50% more PRs than before AI came into the mix (sorry). That kind of velocity is exciting, but it's also a pressure test for every part of your stack that isn't writing code, including your observability practice. Here's what we're hearing from customers about how that's playing out.

Difference Between Elasticity and Scalability in Cloud Computing

In cloud computing, teams use elasticity and scalability as if they mean the same thing. In reality, the two describe different ways a system handles load, and they solve different problems. Mixing them up can be very expensive. You either pay for capacity that sits idle, or your app buckles the moment traffic spikes, and the bill and the incident report both feel it.

Coralogix vs New Relic: Comparison Guide (2026)

Coralogix and New Relic both cover the full observability surface, but they charge for it and store it in different ways. One prices purely on data ingested and writes telemetry to a bucket you own, while the other combines ingest pricing with per-user licensing and retains data in its own backend. This guide covers how the two platforms compare on core features, pricing structure, AI observability, archiving and retention, security coverage, and support, then shows when each one is the stronger choice.

Coralogix vs Sumo Logic: Support, Pricing, Features & More

Coralogix and Sumo Logic are two different answers to the same observability platform decision. Where Coralogix processes telemetry in flight, stores it in your own Amazon Simple Storage Service (S3) bucket, and prices on data ingested, Sumo Logic keeps data in vendor-managed storage and, under its Flex model, bills for data scanned at query time. Both platforms have introduced pricing and artificial intelligence (AI) changes in the past year, and those changes have widened the difference between them.

The hard part of AI root cause analysis is no longer the model

Every few weeks someone tells me root cause analysis is a solved problem now: pipe your telemetry into an LLM, let it tell you what broke. I wish it were that easy. After years on this, I think "can AI do RCA?" is the wrong question, because doing RCA with an LLM is really two separate jobs, and the answer is different for each. They break in completely different ways, so it's worth pulling them apart.

New Feature: Automatic Snapshots When Latency Spikes

We’ve released an exciting new Lightrun capability: set a duration threshold on your Tic & Toc or Method Duration metrics, and Lightrun will automatically capture a snapshot whenever execution exceeds it. It takes moments to configure, and gives engineers the runtime context they need to understand why unexpected slow executions are occurring.

Connecting Ticketing Systems to Microsoft SCOM

Microsoft SCOM (System Center Operations Manager) remains a widely used enterprise monitoring platform due to its deep integration with Windows, hybrid-cloud support, and extensible management packs. However, the value of SCOM is fully realized only when its alerts seamlessly flow into ITSM or ticketing systems. This ensures incidents are created, routed, and resolved efficiently.
Sponsored Post

Avantra 26: A Breath of Fresh Multi-Tenant AIR

There's a crackle and spark in the air at Avantra lately, and I'm so pleased to be writing this bit on what we've accomplished with the Avantra 26 release. Automated root cause analysis, multi-tenant management support for Cloud ALM, enhanced security operations and financial operations monitoring BTP - it's all there, and more. It's an exciting and innovative release for Avantra!

Configuration drift in enterprise networks: Causes, impact, and management

Network admins want all devices with the same role to behave the same way. But in real environments, that consistency rarely lasts. Imagine two core switches in the same data center. They serve the same function and run the same OS version. One handles traffic without issue, while the other drops packets during peak hours. Logs show nothing obvious. Routing looks correct. The team spends hours checking links, hardware, and traffic paths.

The Frictionless Workplace Isn't What You Think It Is: Beyond the Ticket

For many EUC and digital workplace leaders, the challenge isn't a lack of technology. It's understanding why workplace issues continue to surface despite years of investment in automation, AI, and digital transformation. Support teams are still dealing with high ticket volumes. Rollouts intended to improve employee experience can create new sources of disruption, and IT often struggles to understand what employees are experiencing until problems escalate into complaints, incidents, or support requests.

Unleashing Enterprise Agility: The Power of Portfolio Kanban Flow States

In the world of enterprise Agile, we face a persistent paradox: How do we empower individual teams to establish their own unique processes, while ensuring leadership maintains a clear, consistent view of the entire organization’s progress? For a long time, the answer was a compromise.

Your AI isn't underperforming. Your data foundation is.

New research reveals why Australian businesses are entering the new financial year with bigger AI budgets and the same unsolved problem. One in three Australian businesses exceeded their AI budget last year. Yet, half of them plan to increase AI spending again this year. Yet the behaviour that caused those budget overruns remains largely unaddressed.

Logz.io Webinar Recap: A Four-Step Blueprint for Faster Root Cause Analysis

Incident investigations take so long not because the fix is hard, but because finding the right fix is. Most engineers spend 20 to 60 minutes just understanding what’s wrong before they can act, not fixing anything, just trying to see the full picture. The framework that changes this has four steps: Orient, Isolate, Hypothesize, and Verify, and the order matters more than the tools.

When World Cup Traffic Spikes in Mexico, Can You See Where the Internet Breaks?

The World Cup is already proving how quickly digital demand can concentrate across Mexico’s networks, making internet path visibility critical for teams responsible for reliable user experiences. The 2026 FIFA World Cup is already testing Mexico’s networks. Mexico’s June 11 opening match against South Africa drew 7.1 million viewers for an English-language U.S. broadcast and peaked at 9.1 million viewers. That kind of demand puts real pressure on the systems behind digital experiences.

Sentry + Github Copilot Agents

Seer, Sentry's AI debugger, analyzes your issues and finds the root cause. Now you can pass that analysis directly to a GitHub Copilot agent which picks up the context, generates a fix, and opens a pull request. The agent session and PR both live on GitHub, with a link back in Sentry for easy access. This video walks through how the integration works and how to set it up in just a couple steps.

Next.js already traces your requests. Here's how to export them with OpenTelemetry.

Traces are a goldmine of information that can help you, or your AI, find slow pages and fix them. Next.js comes out of the box with support for tracing. Incoming requests, fetch() calls, middleware, and server-side rendering are all wired up and ready to send traces to any OpenTelemetry-compatible backend. The catch is, unless you configure an exporter, you’ll never see those traces.

What Is Agentic Observability? The Complete Guide for Enterprise Engineering Teams

TL;DR Agentic observability uses AI agents to autonomously investigate incidents, identify root causes, and take action in production environments. Unlike traditional monitoring (which alerts and waits) or AIOps (which assists human analysis), agentic platforms conduct the investigation themselves. Key capabilities include autonomous incident triage, evidence-backed root cause analysis, alert noise reduction, and governed remediation.

Instrumenting AI Agents for the Agent Timeline: A Practical OpenTelemetry Guide

AI agents are nondeterministic, multi-step, and opaque. When one fails in production, "the model said something weird" is the cheapest, most useless line in your incident postmortem. To debug agents the way they actually run, you need telemetry that captures all of it, in order, with enough context to reconstruct what happened. The OpenTelemetry GenAI Semantic Conventions give you a vendor-neutral way to do exactly that.

What's New in Scout Monitoring: June 2026

June was about finishing touches. The fun part. Node.js support, which we previewed in May, is live. Anomaly detection graduated with a rebuilt algorithm, per-monitor controls, and access from the API, CLI, and MCP server. We also kept pulling on the same thread from recent months: Scout data should be reachable from wherever you actually work. The MCP server now covers historical insights, anomaly events, and 30-day metrics. Discord is a notification channel. The CLI has scout anomalies.

Why Observability Isn't Enough for AI Coding Agents

Observability platforms collect pre-instrumented logs, metrics, and distributed traces to monitor production systems and surface failures to human engineers. The adoption of AI into engineering has led observability providers to offer those same signals to agents. This is often packaged as AI observability, but the signals themselves were designed around a human investigation loop. AI coding agents work faster, consume data differently, and need feedback as they work rather than after deployment.

What is Network Monitoring? A Guide for IT Teams

Over 90% of mid-sized and large companies estimate that a single hour of downtime now costs more than $300,000. The clock starts the moment something breaks, whether anyone has noticed it or not. And most outages don't start with alarms. They begin with a small issue inside the network: an overloaded switch, a saturated link, or an unstable interface. Left unnoticed, those small issues grow into user complaints, stalled work, lost revenue, and damaged customer trust.

Teach Your AI Coding Agent to Answer Production Questions | Lightrun Ask Prod AI Skill

Lightrun's Gidi Freud demonstrates Ask Prod, the latest Lightrun AI Skill that teaches AI coding agents how to use Lightrun to answer production questions with live runtime evidence. Watch Codex use the skill to discover runtime sources, collect focused runtime data, adapt its investigation, and return an evidence-backed answer. Compatible with Claude Code, Cursor, GitHub Copilot, and other AI coding agents through the Lightrun MCP.

How to Prevent SEO Issues During Website Migrations

Website migrations are often necessary as businesses grow, modernize their platforms, or rebrand. Whether you're changing domains, redesigning your website, switching content management systems, or moving to a new hosting environment, a migration can improve performance and user experience. However, without proper planning, it can also lead to a significant loss in search engine visibility, organic traffic, and revenue.

Cloud Cost Optimization: 20 Strategies for Enterprises

Cloud cost optimization has become a critical priority in 2026. What starts as a manageable $5,000 monthly cloud bill can quickly grow to $50,000 within a few quarters, often without any major change in workload. If you lead an engineering or infrastructure team, this probably sounds familiar. You may have already seen costs rise faster than expected or struggled to explain sudden spikes in cloud spend. The challenge today goes beyond just rising numbers.

Runtime Aware PR Review: Validate Changes in Live Production

Runtime PR review means validating a code change against live variable state, real execution paths, and downstream service behavior before the merge decision. Not after a checkout regression exposes what the diff missed. As AI coding agents ship PRs faster than any reviewer can mentally simulate execution, static analysis and CI leave a structural gap that only runtime evidence can close. This article explains what that gap looks like, why it recurs, and how to close it with runtime context code review.

Why Is Root Cause Analysis So Hard for IT Teams to Get Right?

In this video, learn what Root Cause Analysis (RCA) is and why it's essential for preventing recurring IT incidents instead of repeatedly fixing the same symptoms. Discover how effective RCA helps IT teams identify the real source of problems, reduce downtime, and improve operational resilience. In this video, you'll learn: Contact Us sales@motadata.com Resources Follow Us on Social Media.

From Legacy to AI-Ops: Securing and Scaling Systems for 20M Device Requests with Datadog

Modernizing a legacy system serving 20 million devices without users noticing is like replacing a jet engine mid-flight. In this session, YoungJin Jung and Donggen Hong from LG U+ share their 18-month journey transforming a Telco-scale API Gateway from a rigid, proprietary solution into a high-performance, open-source architecture on AWS, and the operational challenges they solved along the way.

Ship Reliable AI Faster: How to Operate AI Agents with Control and Confidence

Replace "AI shipped on hope" with an operating model that holds up once real users depend on it. AI quality is multi-dimensional, covering accuracy, tone, safety, and faithfulness to user data, and can't be debugged from outputs alone. Without visibility into what their AI actually did in production, teams miss regressions, reverse-engineer chains by hand, and watch a single bad answer erode trust built over hundreds of right ones.

Reduce CDN log costs with searchable archives

Engineering teams that manage high-volume log sources, such as content delivery network (CDN) edges, streaming platforms, and authentication systems, often have to make a difficult retention tradeoff. Indexing every event keeps logs searchable during investigations, audits, and postmortems, but it can make long-term retention expensive.

The AI Engineering Playbook: How to Evaluate & Iterate at Every Phase of Development

AI coding tools are accelerating development velocity, creating a release challenge most teams aren’t equipped for. Without controlled rollout, higher change velocity makes it harder to know which specific release drove the results you’re seeing in production. And when teams use AI, to build AI – LLM apps and AI agents– complexity multiplies. Traditional observability can’t ensure AI agent quality, performance, and cost-efficiency at production scale.

Sanctioned Isn't Secured: The AI Audit Logs Your SIEM Never Sees

Your organization has approved AI platforms for development, data science, and productivity. Procurement signed off. Legal reviewed the terms. Employees are using them. The tools are sanctioned. What isn’t sanctioned is invisibility. The administrative layer of every AI platform in your environment — OpenAI, Amazon Bedrock, Google Gemini, Cursor, Databricks, Glean and others — generates security-relevant events that your SIEM has never seen.

Introducing the StatusGator Confluence integration

We’re excited to announce the new StatusGator Confluence integration. When issues happen, teams need information fast. With the StatusGator Confluence integration, you can embed real-time service status directly into Confluence, making operational updates accessible alongside your team’s documentation and knowledge base.

Getting started with Microsoft Defender dashboards

Microsoft Defender does a great job protecting you and your organization from online threats. It is constantly working to detect and collect security data so you don’t have to worry about falling behind on incidents and vulnerabilities. The Defender portal can also provide great insights into that data, but connecting it to the rest of your stack is difficult.

How we saved over $3 million in idle compute costs with Datadog Kubernetes Autoscaling

At Datadog, our broad Kubernetes footprint amplifies the significance of a familiar autoscaling tradeoff: Overprovisioning wastes cloud spend, while underprovisioning threatens reliability. We built Datadog Kubernetes Autoscaling (DKA) to help teams rightsize their workloads by generating intelligent resource recommendations and automating multidimensional workload scaling. Across Datadog, adopting DKA has eliminated more than $3 million in annualized idle compute costs while reducing reliability risks.

Where did all my Claude Code tokens go?

Most teams judge their AI coding agent on two things: the monthly bill and a feeling. The bill tells you what you spent and the feeling tells you whether it seems to be helping, but neither one tells you what the agent actually did. As these tools move into the critical path of how software ships, that gap is starting to matter. I wanted to replace the feeling with something I could measure and understand what shapes of work affects this bill, so I decided to run an experiment on myself.

Designing the Operational Architecture for Continuous SLA Exposure Governance

Organizations seeking to reduce SLA volatility often attempt incremental enhancements to existing monitoring stacks. While additional analytics layers may improve telemetry visibility, exposure governance cannot function effectively when data, service context, and execution capabilities remain fragmented. Treating exposure management as an add-on capability limits its ability to protect across interdependent systems in real time.

The End of Self-Service IT as We Know It

The modern service desk is not short on entry points. In fact, employees can open a portal, search a knowledge base, start a chatbot conversation, or submit a ticket from almost anywhere. In theory, that should mean fewer queues and faster resolution. But if access to IT has improved so dramatically, why has the operational burden behind each interaction barely moved?

What is AIOps? Benefits, Use Cases, and How It Transforms IT Operations

Decades ago, IT operations was relatively simple, with a few components such as client, server, network, and the static environments. IT teams relied on manual analysis to manage these systems. Over time, however, IT operations has evolved significantly, driving the adoption of AIOps technologies.

Full Stack Observability vs Monitoring: Key Differences

Traditional monitoring tracks system health by collecting data such as metrics and logs, this data is checked to see if a system is behaving as expected and alerts are raised if errors or anomalous data values are found. This works well in stable, predictable environments, but modern IT systems are far more complex and dynamic. In distributed architectures like microservices and cloud-native platforms, predefined alerts usually aren’t enough to explain why a failure is happening.

What's New in Network Observability for Summer 2026

As a network engineer, you likely face two persistent operational challenges every day: When you have to manually track device lifecycles on spreadsheets or spend your scheduled maintenance periods troubleshooting software upgrades, you lose the time you need to proactively ensure network performance. Over the past six months, we have continued to enhance Network Observability by Broadcom. These latest enhancements directly address the operational challenges outlined above.

Chart Your Team's Analytics Journey with Customizable Dashboards in DX NetOps

DX NetOps now features customizable dashboards that give all users some important new features and capabilities. In addition, with the solution’s new integration capabilities, DX NetOps enables users of current analytics and reporting tools to add standardized dashboards over time.

Overview of AI Evaluation (The Context Window #05)

Can you actually trust an AI agent? In this pre-recorded episode of The Context Window, Nicole van der Hoeven sits down with Yas Ekinci, an engineer on the Grafana AI team, to talk about evals — how Grafana measures the quality and reliability of the AI it ships. They get into the difference between online and offline evals, why reviewing AI-generated code has become the real bottleneck, the "final answer problem" of plausible-but-wrong outputs, and o11y-bench, Grafana's open benchmark for observability agents. Along the way.

Help Desk or Service Desk: Which Does Your Business Need?

In this video, learn the key differences between a Help Desk and a Service Desk and why choosing the right approach can significantly impact the growth and efficiency of your IT support operations. Discover when a help desk is enough, when a service desk becomes essential, and how modern IT teams can scale support effectively. In this video, you'll learn: Contact Us sales@motadata.com Resources.

High Cardinality in ClickHouse at Scale: What Actually Breaks

ClickHouse swallows high-cardinality telemetry at ingest, then breaks at query time weeks later. Here is what fails, and how we keep it fast in production. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Achieving sovereign and secure AIOps with Ollama and OpManager

Enterprise IT networks power business operations across the world. As businesses scale to catch up with an increasingly-demanding user base, networks also grow more complex. IT teams managing these networks have to monitor more data than before, under more stringent SLA terms, with little room for failure. Trying to do this manually across thousands of devices can take a lot of time and effort, and are prone to errors.

Replacing Your Legacy Monitoring Platform? Start with a Plan.

Whether you're using SolarWinds, PRTG, Datadog, or another long-standing monitoring solution, chances are your environment has evolved significantly since the platform was first deployed. New applications have been added. Infrastructure has expanded into cloud environments. Teams have developed custom dashboards, reports, alerts, and workflows. Over time, monitoring becomes deeply woven into daily operations. That's why many organizations continue using tools that no longer meet their needs.

June 24 Global Shopify outage: Timeline and impact

On June 24, 2026, Shopify experienced a widespread service disruption that affected storefronts, admin dashboards, and merchant access across multiple regions. While the outage did not impact every user, reports quickly surfaced from merchants around the world who were unable to access stores, log in to administrative tools, or complete routine operations.

Monitor metrics now available in the v3 API

Monitor metrics are now available through the StatusGator v3 API for both Website Monitors and Ping Monitors. These endpoints provide the same latency and performance data available in the Monitor Metrics tab, making it accessible through the API and MCP server. You can find the endpoints in the API documentation.

The Four Pillars of AI Observability in 90 Seconds

AI applications can behave unpredictably, potentially leading to errors such as hallucinations or data leaks, even when classic monitoring indicates a successful response. To effectively monitor AI systems, four key areas should be focused on. Implementing these pillars can enhance trust in AI deployments, help manage costs, and identify safety issues before they impact users.

How Grafana Cloud Ingests Your Data | Data Sources, Alloy & OTel Explained

Learn the two main ways to get data into Grafana Cloud. In this video, we break down how Grafana Cloud connects to over 150 external data sources (like Salesforce, Postgres, and CloudWatch) where your data stays in place, and how you can send raw telemetry into Grafana’s fully managed databases for logs, metrics, traces, and profiles.

How Git Worktrees Changed My Development Workflow

Since I started using Claude Code more frequently, I kept noticing a “worktree” checkbox popping up whenever I started a session in a Git repository. I had no idea what it meant, so I did what any curious developer would do and started digging. What I found was a Git feature I somehow never came across before: git worktrees.

Network Monitoring, the Netdata Way: Topology, NetFlow, SNMP, and Traps

Interface counters tell you a port is busy. Bytes in, bytes out, errors, drops. That’s enough to know a link is saturated, but not enough to know which conversations are saturating it, which devices are involved, or how a problem propagates across your network. For that you’ve traditionally needed dedicated network performance monitoring tools, usually expensive, usually a separate console from the rest of your monitoring.

Telegraf Enterprise Now Generally Available: Manage Telegraf Fleets at Scale

Telegraf Enterprise is now generally available. It combines Telegraf Controller, a centralized management console for Telegraf, with official support from InfluxData. Open source Telegraf remains unchanged. Telegraf Controller is free to start with built-in limits, while a Telegraf Enterprise license unlocks higher-scale limits, audit logging, LDAP/OIDC integration, and commercial support. Telegraf has become the standard for collecting telemetry across cloud, edge, and physical infrastructure.

10 Best ITSM Tools in 2026 [Reviewed and Compared]

How do you choose the best ITSM tool for your team when 20 vendors all promise the same three things: native AI, ITIL alignment, and a single system to run your whole IT operation? It is the fair question we hear most from IT managers and service desk leads, and the cost of getting it wrong is high. An ITSM platform is a multi-year commitment where your team works inside every day, so a poor fit shows fast as slow tickets, manual workarounds, and a migration nobody wants to repeat.

Grafana 13.1 release: observability as code updates, extending Grafana Assistant across more data sources, and more

Earlier this year, Grafana 13 laid the groundwork for making it easier and faster than ever to turn your data into actionable insights. With our latest minor release, Grafana 13.1, we're building on that foundation, expanding observability as code, bringing Grafana Assistant to more data sources, and streamlining the everyday workflows teams rely on to visualize, analyze, and act on their data. Download Grafana 13.1 Below are just some of the highlights from Grafana 13.1.

How Does SNMP Keep an Eye on Every Device on Your Network?

In this video, learn what SNMP (Simple Network Management Protocol) is and why it remains one of the most important technologies for network monitoring. Discover how SNMP helps IT teams collect device health metrics, receive real-time alerts, and monitor thousands of network devices from a single platform. In this video, you'll learn: Contact Us sales@motadata.com Resources Follow Us on Social Media.

How to migrate feature flags without breaking production

Feature flag migrations have a reputation problem. Ask anybody who’s been through one before and you’ll hear the stories, usually from someone still a little frustrated about a bad cutover, with a postmortem or two to show for it. The reputation is mostly undeserved. While the risks are real, they’re well understood and easily controlled. Getting a migration right doesn’t require a big coordinated effort.

POPIA Compliance: What It Requires and How Motadata Supports It

If your organization handles the personal information of people in South Africa, POPIA compliance is not optional. The Protection of Personal Information Act has been fully enforceable since 1 July 2021, and the Information Regulator now backs it with administrative fines of up to ZAR 10 million. The requirement your IT and security teams own most directly is security safeguards under Section 19, and it is the first place a regulator looks after a breach.

Observability on Windows, before eBPF is production-ready

No large enterprise runs a single stack. A shiny new Kubernetes cluster sits right next to a Windows Server box that has quietly run the billing system for a decade without missing a beat. Both keep the business running. Both deserve the same visibility. Linux runs most server workloads, and Coroot grew up there. Our open-source node-agent uses eBPF to collect metrics, logs, traces, and profiles, with no code changes. But "most" is not "all".

How High-Performance IT Organizations Prevent SLA Exposure Before It Becomes a Customer Disruption

Over the past decade, significant progress has been made in incident detection and response across enterprise IT environments. Observability platforms, event correlation engines, and AIOps capabilities have measurably reduced mean time to detection and mean time to resolution. Operational teams are better equipped to identify anomalies, triage alerts, and coordinate remediation across increasingly complex architectures.

How network change management could've prevented a costly switch misconfiguration

Unplanned outages often trace back to a simple but overlooked cause: an untracked configuration change. In many organizations, network device configurations are updated manually without approvals, documentation, or rollback plans. This lack of structure can lead to performance issues, downtime, and compliance risks. In this blog, we'll see how a core switch misconfiguration exposed the risks of unmanaged changes.

Who's in Charge? The 4 Key Pillars of AI Governance in 2026

You hire an astute, hard-working, fresh graduate to run things for you. You hand them the keys to everything in your company; that includes every system, every endpoint, every file, and every password, all of it. Your only instruction to them? "Go ahead and improve things!" Then, trusting in their competence, you leave them to it. Doesn't that sound like a recipe for disaster? Yet that's precisely what's happening in IT departments across the world.

Who's Driving Your Data? How to Regain Control of Your Apache Kafka Infrastructure

Apache Kafka often succeeds faster than operational maturity can keep pace. Consumer lag, partition drift, and configuration sprawl create dangerous blind spots. Learn how unified visibility, governance, and automation transform reactive Kafka operations into predictive control.

How AI Is Transforming Production Issue Investigation for Modern DevOps Teams?

Production failures don't announce themselves cleanly. They arrive at 2 AM, buried inside 40 million log lines, spread across a dozen microservices, and disguised as something that looks entirely unrelated to the actual root cause. For years, engineering teams absorbed this pain through process: runbooks, on-call rotations, dashboards, and a deep institutional knowledge that lived in the heads of their most senior engineers.

Platform Confidence Is the Prerequisite for Modernization Speed

Over the last year, one theme has consistently emerged in conversations with customers: organizations want to move faster, but not at the cost of the operational stability their business depends on. Whether the discussion is about modernization initiatives, automation programs, AI adoption, or platform upgrades, the underlying challenge is often the same. IT leaders are under pressure to deliver innovation while maintaining stability.

What is Compliance in ITAM? Regulations, Penalties & Best Practices

Managing IT assets smoothly is not an easy task. Organizations depend more on technology to execute their operations these days. Hence, the requirement for effective IT Asset Management (ITAM) has grown considerably. However, beyond merely managing these assets, ensuring compliance with relevant ITAM regulations and standards matters just as much. And, in this race to keep up with changing regulations, you are not alone. Many organizations face the same challenge.

How Coding Agents are Changing the Traditional Software Development Lifecycle

AI coding assistants are rapidly evolving from passive copilots into active, agentic collaborators capable of planning, executing, and iterating on complex software tasks. This shift has huge ramifications onthe software development lifecycle (SDLC), developer productivity, and even the structure of engineering teams.

Fireside Chat with Datadog CPO Yanbing Li and Vercel CPO Tom Occhino

The way we build, ship, and run software is being reshaped by AI. In this fireside chat, Yanbing Li (CPO, Datadog) and Tom Occhino (CPO, Vercel) will discuss their perspectives on the impact AI is having across the industry and what it means for teams navigating this shift today.

Progressing AI Beyond Scaling and Into Deep Reasoning

The breakthroughs in AI today aren’t just coming from bigger datasets and more compute; Reinforcement Learning (RL) has quietly become one of the most powerful forces in modern AI development. RL is teaching models to reason and self-correct, enabling capabilities that make AGI feel less like science fiction and more like an inevitable future.

Builder in the loop: Tony Rogers on stress-testing AURA before production

Builder in the loop is a Mezmo interview series focused on the engineers, product leaders, and operators shaping AURA, an open-source, MCP-native agent harness for production operations. This installment features Tony Rogers, whose work on AURA is less about building new features and more about trying to break them before users can.

What Is a CMDB, and Why Is It Called the Heart of ITSM?

In this video, discover why a Configuration Management Database (CMDB) is considered the heart of IT Service Management (ITSM). Learn how a CMDB helps IT teams understand dependencies, assess change impact, accelerate incident resolution, and build a reliable foundation for service management processes.

Using Evaluation Frameworks with Agent Observability

AI teams have invested heavily in evaluation frameworks, yet getting those frameworks beyond local experimentation remains challenging. Teams using open source libraries like DeepEval and Pydantic Evals gain flexibility and research-grounded metrics, but operationalizing those evaluations still requires brittle custom integration code that doesn’t scale.

The AI bill arrived. Now what?

There was a time when “Opus” meant a classical composition and “Sonnet” was fourteen lines of Shakespeare you definitely did not read before the test. Now they’re model tiers, and every new release rewrites the economics of your engineering org whether you’re ready or not. Currently, your monthly total hides the crucial information you need to control and justify AI spend.

Which AI-Powered Observability Tools Accelerate Root Cause Analysis (RCA)?

TL;DR Choosing the right AI-powered observability platform isn’t about who has the most AI features. It’s about which platform helps your team identify root causes faster and spend less time investigating incidents. Here’s the short version: Logz.io + OrionIQ: Autonomous AI agents investigate incidents, perform root cause analysis, and surface next steps. Open standards, Kubernetes-ready, and deploys in as little as a week.

Vendor Outage Monitoring for MSPs: Per-Client Status Pages and Custom Dashboards

Handling client calls when a third-party vendor has an outage - this will sound familiar if you are a managed service provider (MSP). Your first instinct would be to check if the vendor's status page or social media handle shows anything, or check crowdsourced websites like Downdetector. Or even ask your client to check themselves. These approaches do not scale when you have more than a few clients, many vendor status pages to check, and clients with different stacks.

StatusGator is now available in SharePoint

We’re excited to announce the new StatusGator SharePoint integration. Many organizations use SharePoint as the central hub for company resources, communications, and internal tools. Now, you can add real-time service status directly to your SharePoint pages, helping employees stay informed about outages, maintenance, and service disruptions without leaving the platforms they already use every day.

Monitoring vs. observability: The future of IT operations in 2026

For years, monitoring was the gold standard of infrastructure management. Dashboards. Thresholds. Alerts. If everything on the dashboard was green, you didn't need to worry. If something turned red, you responded. It was a model built on predictability, and for a long time, it worked. But modern infrastructure is no longer predictable.

DataStream 2.0: Faster, Smarter, Built for Scale

June 19, 2026 This is not a regular monthly update. DataStream Version 2.0 is a milestone — the result of relentless building, learning from customers, and pushing the platform toward what enterprise-scale security operations actually demand. The core has been rebuilt, new capabilities have been added across the board, and the platform is now faster, more resilient, and more extensible than ever. Here’s what’s new.

How AI-Powered Monitoring is Transforming IT Operations

Every monitoring vendor on the market now has an AI story. AIOps has moved from category buzzword to standard line-item in IT operations strategy, and the reasoning is sound: as infrastructure spreads across cloud, hybrid, microservices, and virtualized platforms, the volume and velocity of operational data has outrun what human teams can process. AI-powered monitoring is the obvious answer.

Digital Employee Experience Monitoring: Why It Matters for Hybrid Workforces

As enterprises embrace hybrid work models, SaaS-driven technology stacks, and highly distributed digital workplaces, employee experience has become inseparable from business performance.For years, IT investments were focused for customer-facing digital journeys, and internal systems were not a priority. However, the scenario has changed. Today, every employee relies on a complex and interdependent chain of endpoints, networks, cloud services, identity platforms, and business applications.

Integrating Digital Employee Experience (DEX) with ServiceNow: What IT Teams Need to Know

As CTO for Teneo, I get the opportunity to meet with many of our customers to talk about plans for the next few years. I often find we spend a lot of time talking about Digital Employee Experience, but far less time is spent fixing the operational friction that quietly erodes it. Slow devices, degraded application performance, and recurring service desk tickets are common themes in many organizations.

Why Relational Databases Fail Satellite Telemetry

Satellite operations depend on telemetry as the primary interface to systems that teams cannot directly inspect. Once a spacecraft reaches orbit, signals such as battery levels, temperature, signal strength, and fault codes become the foundation for understanding system health and maintaining control. Telemetry streams continuously, so the underlying data system becomes a critical control point that needs to handle a constant, heavy flow of data.

Introducing the New Galileo Website: A Better Resource for IT Visibility, Optimization, and Planning

That's why we've launched a completely redesigned Galileo website. The new site isn't just a fresh look but rather a reflection of our commitment to helping IT teams gain the visibility, insight, and guidance they need to manage modern infrastructure more effectively.

Working as a remote engineer at Cribl | Building the AI Platform for Telemetry

Learn what it’s like to work as an engineer at Cribl, a remote-first company building the AI platform for IT and security data. In this recruiting video, Cribl’s engineering and support leaders share how fully distributed teams collaborate, solve hard data problems, and grow their careers while working from around the world. You’ll hear from managers and leaders in site reliability engineering, security incubation, and technical support about.

KWhy? MSP Webinar

Most MSPs are sitting on a goldmine of data across their tools. The problem isn’t access, it’s knowing what *actually* matters… and how to use it to drive better outcomes. Join Amanda Doucette-Lachapelle and Kyle Christensen (Empath) as they walk through how to use KPIs to make smarter, more confident decisions, with real examples you can apply right away.

Troubleshooting ActiveMQ Producer Flow Control Blocks

The alert comes in at 2 AM: your order processing service is unresponsive. The application is not crashed, threads are running, the JVM is healthy, but no messages are being sent. Your operations team traces it to a blocked send() call on an ActiveMQ connection. Hours later, after restarting the application, someone finds this line in the broker log from 11 PM the previous day.

The Second Edition of Observability Engineering Is Here

IT’S HERE it’s here it’s here it’s here!!!! The second edition of Observability Engineering is available for download, and since Honeycomb is the sponsor, you can now download it from our website (the dead tree version will take another month). This is a strange time to be writing a book.

Agent Timeline Is Now Generally Available

A few weeks ago I wrote about a customer’s refund request that stopped halfway through at 11:47 p.m. on a Tuesday night. That post walked through the 40 minutes it took to work out what happened when an agentic application had a problem: a tool retried against a rate-limited payments API, the error responses filled up the context window, and the agent gave up. The whole reason we built Agent Timeline was to turn that 40 minutes into five. To reduce MTTR. To solve the problem and get back to sleep.

Service Level Agreement (SLA) Templates: Examples, Metrics, and Best Practices

How quickly should your team resolve a critical ticket, and what are the consequences when it misses the target? That is exactly where Service Level Agreements (SLAs) come into play. An SLA turns service expectations into measurable commitments by defining clear response and resolution targets. Rather than starting from scratch, an SLA template provides a structured foundation for establishing those commitments and tracking performance against agreed standards. Why does that matter?

The Data Plane Reality: OTel Scales, While Topology UX Lags

OpenTelemetry won the architectural standards battle. At scale, though, telemetry breaks more like plumbing than code. It breaks quietly, across a graph, with a blast radius you don’t understand until it’s expensive. With over 65% of organizations now running more than 10 collectors in production, hybrid deployments across Kubernetes and VMs are accelerating fast. Telemetry standardization is no longer a project milestone. It is a baseline expectation.

5 Alternatives to Prometheus in 2026

Prometheus is a battle-tested, flexible and, most importantly, free tool that has long been the go-to open-source monitoring solution. Much of its popularity came down to its simplicity. A few years have gone by, though, and the APM space has gotten pretty crowded. Developers are now starting to move away from the complexity of self-hosting, and OpenTelemetry stands out as one of the CNCF’s fastest-expanding projects. In fact, it’s now among the most adopted telemetry frameworks out there.

The Illusion of Control: Why Dashboards Do Not Equal SLA Protection

Modern operations teams work within a constant stream of dashboards, status summaries, and health indicators that turn complex environments into organized visual displays. Large screens show color-coded service conditions. Executive reports quantify uptime. Observability platforms map system dependencies across cloud, hybrid, and distributed architectures. This visual structure creates a sense of order. In environments defined by constant change, that sense of order can feel like control.

Observability for a Privacy-first AI Wearable | Grafana Everywhere

Trust is everything when AI gets personal. Golden Grot Award winner and NeoSapien co-founder and CEO Dhananjay Yadav shares how his team uses Grafana Assistant to ensure the privacy-first AI wearable delivers a seamless, reliable experience without compromising its mission. Because when AI moves closer to our everyday lives, teams need to know what’s happening — and users need to trust that it’s working as intended.

Tapirs, Trainings, and Team Dinners: My First Kentik Meetup

Gavin joined Kentik’s People Ops team less than a year ago, so when April brought his first team offsite and his first HR conference in San Diego, it was a lot of firsts at once. He writes about meeting his colleagues face to face for the first time, what he took away from HRA 26, and his new appreciation for tapirs.

3 Signs Your Network Monitoring Is Failing You

Are users reporting issues before your monitoring tools do? Are critical alerts getting lost in the noise? Does root cause analysis take hours instead of minutes? These are 3 signs your network monitoring is failing. Discover how modern observability helps teams detect issues faster and resolve them with confidence.

From Alerts to Action: How Agentic AI Will Transform ITOps

What if your IT systems could go beyond detecting issues to resolving them autonomously? This white paper explains how Agentic AI enables IT operations to shift from reactive monitoring to intelligent, self-driven execution. Explore use cases, challenges, and how observability data powers AI-driven actions.

From event correlation to autonomous IT: Why observability isn't enough anymore

Most IT war rooms have plenty of data, but not enough time or clarity to find the real answer. Dashboards are crowded, alerts keep piling up, and the real issue gets lost in all the noise. Ever dealt with this situation? You’re not alone, and there’s a simpler way to deal with it. OpManager Nexus closes this gap by moving beyond visibility to help teams actually diagnose and fix problems faster.

Monitoring website that redirects to a different URL

Is it necessary to monitor a website that redirects to a different URL? Imagine a user visits a URL and is automatically redirected to a new main URL without taking any action. This process is called URL redirection. It typically occurs when a web server sends a 3xx HTTP status code and a location header with the new URL. Sometimes there is only one redirect, but in other cases, the request passes through several URLs before reaching the final page.

9 Powerful Log Monitoring Best Practices to Follow in 2026

How many of your last five incidents were already sitting in the logs before anyone noticed? Most teams already collect more than enough log data. The problem starts with what happens next, and the same four gaps show up almost everywhere: This guide covers the log monitoring best practices that close those gaps. It walks through how to collect, structure, correlate, retain, and secure logs, so monitoring becomes a steady process and not a scramble during the next incident.

Why Does Network Topology Decide How Fast Your Network Recovers?

In this video, learn why network topology plays a critical role in network resilience, troubleshooting, and recovery. Discover how understanding network dependencies, eliminating single points of failure, and maintaining clear visibility can help IT teams reduce downtime and accelerate incident response. In this video, you'll learn.

Features in Icinga Web 2 Worth Knowing About

When you work closely with Icinga Web 2, developing modules, building dashboards, poking around the internals, you naturally pick up on features that most users never think about. Some are usability improvements that deserve more attention than they get. Others are developer conveniences that turn out to be genuinely useful in the right user situation too. They’re just the kind of thing that rarely makes it into the getting-started guide. Not all of these will apply to your daily workflow.

Telemetry Talks ep. 5 - OpenTelemetry in the AI agents era

Telemetry Talks explores how OpenTelemetry’s CNCF graduation arrives at a pivotal moment for AI-powered development. Together with Alex Marshalov, we dive into vibe coding, AI agents, and the growing need for observability in GenAI systems — from prompts and token usage to reasoning chains and distributed traces — using the VictoriaMetrics stack and OpenTelemetry as the foundation for understanding the next generation of autonomous software.

What Is Your Operating Model Costing Your Business?

The biggest cost in your business may not appear anywhere on your balance sheet because some of the most expensive problems are rarely measured directly. Lost productivity, recurring technology issues, underused applications, and the effort required to manage them all accumulate over time without ever appearing as a line item in a financial report.

ActiveMQ Protocol Comparison: AMQP vs MQTT vs OpenWire vs STOMP

One of ActiveMQ's most powerful and underappreciated capabilities is its protocol polyglotism: a single broker can simultaneously accept Java JMS clients over OpenWire, Python services over AMQP, IoT sensors over MQTT, and Ruby scripts over STOMP, all routing messages between each other without protocol bridges or translation middleware.

Datadog Data Observability: Be the first to know when data fails

Bad data doesn't announce itself. Datadog Data Observability gives you unified visibility across your entire data stack—from source systems and pipelines to dashboards and AI applications—so you catch silent failures before they cascade. Detect data quality and pipeline issues before stakeholders do, pinpoint root causes with end-to-end lineage, and reduce pipeline costs with job, cluster, and query recommendations.

What's New in InfluxDB 3.10: Performance Beta Expanded with New Enterprise Features

In our last release, we introduced a beta of performance updates designed for heavier, more complex time series workloads. InfluxDB 3.10 expands that beta to include enterprise features that give teams more control as they scale and manage larger workloads in InfluxDB 3. This release adds end-to-end backup and restore, row-level deletes, bulk import from Parquet, user management, and an RBAC preview to the previous performance beta.

Reduce Alert Fatigue with Composite Alerting in Hosted Graphite | Tutorial

Tired of noisy alerts waking you up for issues that are not actually impacting your services? In this tutorial, we walk through MetricFire's Composite Alerting capabilities and show how to combine multiple metric conditions into a single high-confidence alert using AND / OR logic. Learn how to: Reduce alert fatigue and false positives Create service level alerts in Graphite Combine CPU, latency, and database metrics into meaningful alerts Use conditional logic to improve signal quality Build smarter observability workflows with Hosted Graphite.

Why AI observability is a critical ITOps priority

AI Observability is a Critical Priority for ITOps Teams See how LogicMonitor helps ITOps teams monitor AI workloads, reduce blind spots, and move toward Autonomous IT. Schedule a meeting AI has shifted from experimental pilots to everyday business operations. Customers are interacting with AI-powered applications. Engineering teams are building with LLMs, GPUs, APIs, and automation at a much faster pace. That adds to the visibility strain on already overburdened ITOps teams.

Scout MCP Server: Example Prompts, Use Cases, and What's New

The Scout MCP server connects your AI assistant directly to your Scout Monitoring data. Instead of switching between your editor, Scout, and a chat window, your assistant can pull traces, errors, N+1 insights, and endpoint metrics on its own and use that context to suggest or make fixes right in your codebase. This covers how to connect it, what to ask it, how other teams are using it, and what we shipped recently.

When Local Blocks Go Global: The India-Telegram BGP Incident

Yesterday’s leak of a BGP hijack intended to block Telegram in India is the latest routing mishap best described as intentional, but also accidental — a pattern dating back to Pakistan Telecom’s infamous hijack of YouTube in 2008, in which a domestic block escaped containment and disrupted the service worldwide.

How Worker Safety RTLS Creates Safer Industrial Work Environments

Step onto the floor of any heavy stamping plant, automotive fabrication cell, or high-velocity distribution hub, and you see safety treated like an afterthought wrapped in a compliance checklist. You find yellow lines painted across the concrete, warnings stuck to every pillar, and flashing blue strobe lights mounted on the backs of forklifts. Yet close calls, near-misses, and serious floor injuries keep happening. These old-school safety methods fail because they place the entire burden of survival on human vision and split-second reflexes.

New: Save time during incidents with incident templates

Creating incidents often means filling out the same information over and over again. That’s why we’ve added Incident Templates – a faster way to create incidents using pre-configured settings. With templates, you can save commonly used incident details and apply them with a single click whenever you need them.

Why CI/CD Pipelines Miss Runtime Failures

CI/CD pipelines do four things: it builds code, runs tests against mocked dependencies, lints for style violations, and scans for known vulnerability patterns. What it cannot do is validate how that code behaves under real users, real service responses, and real runtime constraints that staging was never configured to reproduce. That entire class of failure clears every gate cleanly and surfaces only in production.

Analysing Claude Code telemetry with SquaredUp - diving deeper

In our previous article we looked at the basics of: In this article, we are going to take a deeper dive into some of the complexities of configuration as well as some of the nuances of analysing Claude telemetry. Before we dive into the code, let us just remind ourselves that our telemetry pipeline looks like this: That is, we are emitting Claude Code telemetry to an OpenTelemetry Collector. The telemetry is then exported to an Application Insights endpoint and stored in Log Analytics tables.

MSP Summit: Why You Need Effective Documentation & How to Achieve It

Every year, MSP Summit unites some of the brightest minds in managed services. From tackling complex migrations that should have been straightforward to managing thousands of unique client environments, MSPs excel at adapting and rising to challenges, even as industry trends evolve. Even as industry trends evolve, though, one theme consistently comes up year after year: documentation.

Introducing Datspaces and Datasets

Dataspaces and Datasets | The Structured Data Layer for Teams and AI | Coralogix Dataspaces and Datasets from Coralogix: the structured data layer teams and AI were waiting for. Turn a single query into a reusable dataset, share it across teams, and keep dashboards fast as your data scales. In this video: Timestamps: Dataspaces and Datasets are available now in Coralogix. Whether you're building dashboards, running background queries, or powering AI agents with telemetry data, Dataspaces give your organization a governed, high-performance data architecture that scales with your teams.

Inside the AI Team Weekly: AI Observability workflows and Prometheus exemplars (May 19th, 2026)

The Grafana AI team (Engineers Ivana Huckova and Sonia Aguilar) share what's new in AI Observability this week: a new way to instrument and visualize agent workflows, plus a neat trick for jumping straight from a metric spike to the exact conversation that caused it using Prometheus exemplars. In this episode: We're showing parts of our team meetings to build in public in some small way and give you a sneak preview of what's to come. But not all features we show may make it to production! You've been warned. :)

Deep AI Investigation for ITOps: What It Is and Why It Matters

Investigation is the most time-consuming and cognitively demanding phase of incident response, and it’s the phase least served by existing tooling. Modern ITOps teams have spent years investing in better detection and alerting. The tools are faster, the dashboards are richer, and anomaly detection keeps improving.

IsDown is joining UptimeRobot

Today I'm sharing some big news. IsDown is joining UptimeRobot When I started IsDown, the idea was simple. Keeping track of outages across dozens of vendor status pages was painful, and I wanted to make it easy to see, in one place, when the services you depend on go down. Thousands of teams now rely on IsDown to do exactly that. Joining UptimeRobot is the natural next step.

Visibility Isn't Reliability: Why Observability Alone Cannot Protect SLAs

Over the past decade, enterprises have invested heavily in observability platforms designed to deliver comprehensive insight into increasingly complex environments. Modern systems generate continuous telemetry across infrastructure, applications, networks, cloud services, and third-party dependencies. Metrics, logs, traces, and topology maps now provide a level of technical transparency that would have been difficult to imagine only a few years ago.

Eight best practices for a successful cloud migration strategy

Moving to the cloud is one of the most consequential decisions an IT organization makes. A successful cloud migration strategy sets the foundation for how your business scales, innovates, and competes. But too often, cloud migration initiatives stall, underperform, or force organizations to repatriate applications back on-premises because the groundwork wasn’t laid correctly.

Un-observable AI is Un-trustworthy AI

Recently, someone talked Chipotle’s customer support agent into reversing a linked list – a task completely unrelated to burritos in any way. Screenshots circulated, people laughed, but underneath the joke sat a sharper question. If a production support agent will do that on a public channel, what else will it do that nobody is screenshotting? The bug is funny. The trust gap behind it is not.

Use This OTel Processor to Prevent Your Dashboards From Breaking

A semantic-convention rename (http.method → http.request.method) can silently break your RED metrics — no errors, just gaps in dashboards and alerts. The OpenTelemetry Collector's schema processor fixes it: put it first in your pipeline and it normalizes attribute names no matter what each service emits. Migration mode writes BOTH the old and new names, so you get zero-downtime upgrades while queries keep working.

Troubleshooting website response time latency

Your dashboards may be telling a different story than what the customers are experiencing There's a version of a website problem that nobody talks about enough—the one where everything is technically fine. The site is up. The server is responding. No alerts have fired. And yet, somewhere out there, a user is watching a spinner rotate for the fifth second in a row, quietly losing faith in your product. This is what makes response time latency the most deceptive problem in web operations.

Troubleshooting website connection failures with website monitoring RCA

Every engineer has a story about the outage that came out of nowhere. One moment everything is green. The next, your monitoring dashboard lights up red, your inbox fills faster than you can read it, and somewhere a customer is staring at a blank screen wondering if your business still exists.

Alibaba Cloud monitoring: What changes when scale, speed, and cost collide

Alibaba Cloud monitoring isn't AWS or Azure monitoring with a different logo. The way its services scale, absorb load, and send early warning signals follows its own logic and if you're watching the wrong things, you'll find out too late. Cloud monitoring conversations often follow patterns set by AWS and Azure. The metrics are familiar, dashboards look the same, and operational playbooks are built around expected infrastructure behavior.

Observability: Are You Measuring What Actually Matters?

Observability has always been important, and much like any core capability in your business, the value needs to be understood. For years, the value of observability was predictable. It was uptime, error rates, MTTR, and likely tool consolidation. That was enough to be able to show progress. These are foundational, tablestakes metrics—and they still matter, but they aren’t enough.

Kubernetes Monitoring: Datadog Alert to Lightrun Root Cause

Datadog Kubernetes monitoring tells an SRE team what failed, which pod failed, and when. It does so within seconds of the alert firing. The investigation then stalls at the same point every time: nothing in the dashboard layer can prove why a specific request behaved the way it did inside a running JVM at the moment of failure. Variable values, feature flag evaluations, and code branches are never captured.

How to create User-Defined Datasets in Coralogix

Learn how to create a user-defined dataset in Coralogix and route telemetry data into it using TCO policies with granular DataPrime expressions. In this walkthrough, you'll learn how to:• Create a new dataset with its own schema, permissions, retention, and cost visibility• Configure PBAC settings for governed access control• Route data using DataPrime expressions in TCO policies• Fan out events to multiple datasets from a single source.

How to Reduce MTTR: 5 Proven Strategies for Enterprise IT Teams

Every minute of downtime impacts your business. Mean Time to Resolution (MTTR) measures how quickly your team can resolve incidents and restore services. In this video, learn 5 proven ways to reduce MTTR using unified observability, AI-powered alert correlation, automated runbooks, and ITSM integration to resolve incidents faster and minimize downtime. In this video, you'll learn.

Product Update - June 2026

IncidentHub's latest product update includes private status ingestion for Microsoft Azure and Microsoft 365, a simpler UI for alerts configuration, an option to disable the public status page, and a better looking status page layout. Plus, support for more vendors (1070+ and counting). As always, I am grateful to all our customers and beta testers who have shared their feedback which has made IncidentHub better.

Overview of Custom Checks

In this video, we’ll walk you through on how to set up and configure your Custom Checks in Uptime.com. Learn how to effectively monitor your automations and processes using Uptime.com’s Custom Checks. This tutorial covers Heartbeat and Incoming Webhook checks, ensuring your tasks run smoothly and delivering instant alerts when issues arise. Discover how to set up and configure these checks to maintain optimal performance.

How Zero Trust is Reshaping Federal IT Strategy

Zero trust sparked a paradigm shift for federal agencies, changing the way they approach IT and data management as they "assume breach" from threat actors. Brian Chamberlain, Public Sector Business Development Lead at SolarWinds, explains how starting with observability helps federal agencies lay critical groundwork for meeting zero trust directives.

Find the Lookalike Domains Impersonating Your Brand: A Free Phishing & Typosquatting Scanner

Somewhere out there, a domain that looks almost exactly like yours may already be registered. Maybe it swaps one letter. Maybe it uses a Cyrillic character that is visually identical to a Latin one. Maybe it just adds the word "login" or "secure" to your brand. These lookalike domains are the raw material of phishing, and most companies have no idea how many exist for their brand until something goes wrong.

ClickHouse LowCardinality: When It Helps and When It Hurts

ClickHouse LowCardinality cuts storage and speeds up queries on low-cardinality columns, but backfires on trace IDs. How to tell the difference. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Python Error Tracking for Django, Flask, and FastAPI: A Practical Setup Guide

Your Python app is throwing errors in production right now. Some of them are obvious: a 500 response, an angry Slack message from support. But most are quiet. A background task swallows an exception. A race condition surfaces only under load. A third-party API returns unexpected data and your code handles it by not handling it. If you’re relying on log files and user reports to find these, you’re debugging after the damage is done.

Why Your Agentic Workflow Succeeds and Still Gets It Wrong

Agentic workflows are reshaping how engineering teams operate, fetching context, synthesizing decisions, and shipping results across systems without human intervention. But the same design that makes them powerful adds risk in production. Agents do not crash when they hit bad data; they synthesize around it, substituting a stale value, an empty page, or a missing field for the result they were supposed to capture.

Generate Synthetic Time Series Data in InfluxDB 3

Getting InfluxDB 3 up and running is a pretty lightweight process with the installation script. Getting time series data into it is the next step, and for exploration, basic testing, or scenarios where you don’t have a stream of time series data ready to write, that can be a point of friction. That hurdle is particularly high when you want to test the rest of the system around the data you’d be writing.

Your Monitoring Stack Wasn't Designed. It Was Procured.

The 2am war room hasn’t gone anywhere. Ten years after Gartner coined the term AIOps, the platforms are bought, the licenses are renewed, the dashboards are live — and serious incidents still get resolved by engineers paging across multiple consoles, trying to work out where the fire actually is. MTTR has barely moved. Alert fatigue hasn’t eased. The outcomes the category promised, in most enterprises, have not arrived. Matt Lowe’s recent article on AIOps names the shortfall well.

How to Troubleshoot High CPU Usage on Network Devices

Most network teams only find out their firewall is overloaded after users start complaining. A slow VPN, dropped calls, and random packet loss at 2 pm every day. The usual suspects get blamed first: the ISP, the switch, the application server. The firewall gets a pass because the dashboard says 40% CPU and everything looks fine. Here is the problem with that picture. Standard SNMP monitoring polls every 5 minutes. A CPU spike that peaks at 95% and recovers within 90 seconds never shows up.

Better, faster, less wrong: Enhancing issue grouping

Sentry’s job is to tell you when your app breaks. To do that, we group individual errors into issues. First by fingerprinting, which lexically matches errors based on their structure, then by an AI fallback: when fingerprinting can’t find a match, an ML model compares the new error’s stacktrace against existing issues and merges it if they’re semantically similar.

How AI is Reshaping IT Operations Management

AI is transforming IT operations through automated incident response, intelligent event correlation, predictive analytics, and agentic AI. But while technology is evolving rapidly, human judgment and strategic decision-making remain essential. In this video, explore what's changing in IT operations, what isn't, and how IT leaders can prepare for an AI-driven future with AIOps, observability, and automation. Learn how Motadata helps organizations build smarter, more proactive IT operations.

Building More Resilient Multi-Cloud Operations

The last post in this series looked at how disconnected alerts can slow incident response and how stronger correlation helps teams investigate issues with more clarity. That same operational context has value beyond triage. It also plays an important role in resilience, service assurance, and the ability to maintain confidence across increasingly complex multi-cloud environments. Resilience depends on more than reacting well during an outage.

Avantra 26 Overview: AI-powered SAP operations across your entire hybrid estate.

Avantra 26 brings AI root cause analysis, SAP Cloud ALM integration, expanded BTP visibility, and next gen automation together in one platform. Avantra AIR investigates incidents the moment they're detected and surfaces a structured diagnosis with next steps, cutting resolution times by 60% and turning hours of expert triage into seconds. As an SAP Cloud ALM Silver Partner, Avantra delivers production-ready, two-way synchronisation of systems and alerts across multiple Cloud ALM tenants.

Avantra 26 next-gen automation: self-service SAP workflows with full guardrails

Avantra 26's next-gen automation experience puts SAP automations in the hands of your users — through guided wizards with scoped permissions, lifecycle notifications, and a full audit trail. Watch this demo of SAP client settings (SCC4) change on a RISE with SAP S/4HANA system: configured in five steps, executed automatically, documented end to end. Avantra customers reduce manual operational effort by up to 70%. Now you're really running.

Avantra + SAP Cloud ALM Demo: Two-way Cloud ALM sync in action across your entire hybrid estate.

An SAP Cloud ALM Silver Partner, Avantra 26 delivers a production-ready SAP Cloud ALM integration — two-way sync of system data and alerts, multi-tenant Cloud ALM visibility, and the ability to act on Cloud ALM systems directly within Avantra. One platform for RISE, hybrid, and everything beyond.

What is Cloud Security - Explained in 5 minutes

Cloud security isn't just about locking things down — it's about staying ahead of threats in fast-moving, dynamic environments. In this video, Kat breaks down what cloud security actually means in 2024 and why traditional approaches don't cut it anymore. In this video: Whether you're securing containers, Kubernetes workloads, or multi-cloud infrastructure, this is your foundation. Subscribe for more cloud security explainers, tutorials, and best practices from Sysdig.

The Next Evolution of Infrastructure Observability

Operational visibility is becoming increasingly important as infrastructure teams are asked to support AI initiatives, automation goals, cost accountability, modernization efforts, and growing operational complexity at the same time. Most are expected to do it without expanding headcount, introducing additional risk, or rebuilding the environment from scratch. Those expectations are changing the role of infrastructure operations.

Home Sweet Hybrid - There's more to your transition than core ERP

Most large enterprises elect a hybrid approach to SAP operations, but this is a more recent trend. And, as SAP operations professionals, we are still learning about the impacts of this choice and approach, even though it’s often the most sensible and pragmatic. Years ago, everything ran on-premises.

No SAP Expertise? No Problem. Automation Just Got Easier

Avantra 21 introduced the concept of Automation workflows. By Avantra 23, compatibility with Ansible was added. Along the way, Avantra became the tool SAP teams reached for when they wanted system copies, refreshes, and backup orchestration to just run — predictably, on a schedule, without a senior engineer babysitting them. So, automation isn’t new to Avantra users. Avantra made automation for SAP practical to deploy, predictable in operation, and easier to maintain.

Tencent Cloud: When systems start reacting to themselves

Distributed systems don't just fail. They adapt. Services in Tencent Cloud environments are tightly interconnected. Compute, load balancing, databases, and networking layers continuously respond to each other based on changing conditions. Under normal load, this coordination stays in the background. As pressure builds, the behavior shifts. The system does not degrade in a straight line. Instead, it starts adjusting itself.

Introducing the StatusGator Notion Integration

Many teams use Notion as the central hub for documentation, runbooks, incident response, and operational planning. When an outage occurs, the last thing you want is for responders to jump between multiple tools searching for information about the health of critical vendors and dependencies. That’s why we’re excited to introduce the StatusGator Notion integration.

How to use Postman Visualizer: a step-by-step guide

API responses are often easier to understand when they are displayed visually instead of as raw JSON. While Postman is widely used for testing APIs, many developers overlook one of its most useful features which is the Postman Visualizer. While it is not as fully featured as a dedicated dashboarding platform like SquaredUp, it is a great way to quickly visualize API responses during development and debugging.

Federated Search | From Silos to Insight | Azure Blob Schema Discovery with Splunk's Crawler

This walk-through shows how Splunk's Cloud can discover schema and partition keys for Microsoft Azure Blob Storage datasets and create searchable Splunk managed tables. Once the data is mapped, analysts can use Splunk Federated Search to query Azure Blob data where it lives, bringing cloud-resident logs into security, observability, and operational work-flows without re-ingesting the data.

Catch visual regressions with Snapshots, now in beta

Sentry Snapshots diffs screenshots on every commit and blocks the PR if there are any visual changes so you can confirm they’re intentional. Users don’t interact with code, they interact with something they can see and touch. Snapshots gives you a lightweight way to test it. It’s easier than ever to change code. It’s also easier than ever to trade quality for speed. Modern codebases need guardrails to ensure correctness.

Visualising Claude Code telemetry in SquaredUp

Engineering teams are shipping more AI-generated code than ever, but at what cost? Learn how to build a telemetry pipeline to monitor Claude Code usage and costs directly in SquaredUp. It is estimated that 85-90% of engineering teams are now using AI coding assistants such as Claude, Codex and Cursor. This is not just for small-scale pilot projects— around 40% of all code now being shipped is AI-generated, and in start-ups the figure is around 95%. This can result in incredible productivity gains.

How Skylar MCP Gives Agentic Workflows the Operational Context to Act With Confidence

AI models can reason over language, summarize findings, and explain patterns. What they cannot do on their own is see the real-time operational state of your environment. Ask a model about a critical incident and it will answer from whatever context it is given, which means the answer is only as trustworthy as the input. In operations and compliance workflows, an answer is only useful if it is grounded in current service context and governed access to the systems that define reality.

Safeguard Revenue and Brand Trust with Full-Stack Visibility

The quick download: Most observability strategies overlook the internet layer that underpins every user’s digital experience, leaving it almost entirely unmonitored. Most IT teams monitor servers, networks, and applications, yet the infrastructure layer that carries traffic to users remains largely unmonitored.

Satellite Telemetry, ITAR, and Data Residency: Building Architecture for Speed and Control

Satellite mission operators depend on telemetry to understand spacecraft health, ground system performance, and mission status in real-time. Operation signals help teams identify risks, investigate anomalies, and keep operations moving. When a spacecraft enters safe mode or signal strength drops during a contact window, teams need trusted telemetry immediately. But mission data moves quickly across operational systems, and every handoff makes it harder to control.

Finding the Slow Query Killing Your Rails App

Performance problems in Rails applications are sneaky. Generally speaking, nobody opens tickets that say “my application is slower than it was last month (about 20%)”. What you do get instead are vague complaints from team members about a p95 latency that is climbing every week or a background job that used to take 2 seconds now taking 40 seconds to finish.

Monitoring Protocols Compared - Which Standard for What

Modern applications are distributed, ephemeral and built from a dozen moving parts. To keep them reliable, you need real visibility: not just “is the server up?”, but“how is this request behaving, right now, across every component it touches?”. The good news is that the observability world has converged on a handful of open standards.

How to Choose the Right Server Monitoring Tool: A Step By Step Guide for 2026

How do you pick one server monitoring tool when every vendor page promises the same thing? A few years ago, two monitoring vendor websites showed you two different products. Today you can open five and read nearly the same feature list on each one. Real-time dashboards, instant alerts, AI everywhere. That sameness has made evaluation harder than ever. The marketing tells you nothing, and the wrong choice follows your team for years, either as features nobody opens or as the one missed alert at 2 a.m.

ChangeTower User Stories - Turning Public Web Changes into Recruitment Pipeline

For modern business teams, the public web is the single largest source of competitive and market intelligence — and one of the hardest to keep up with. Compliance teams track changes to regulations, policies, and terms. Competitive intelligence teams watch rivals’ pricing, positioning, and personnel. Recruiters and business developers monitor hiring activity that signals new opportunities. In every case, the value lies in noticing a change before anyone else does.

Proactive Alerting with AIOps

Modern IT environments generate huge volumes of telemetry across infrastructure, applications, cloud services, and networks. Teams now have more data than ever, but that does not automatically lead to better decisions. In many organizations, the real problem is no longer visibility alone. It is the ability to identify which signals matter, understand what they mean, and respond before users or business services are affected.

Top New Relic Alternatives in 2026

New Relic is a capable full-stack platform, but its bill is built on two axes that both grow as you scale: data ingested and per-user seats. Full-platform user fees run $49 to $349 per user per month, so a 20-person team can pay $6,980 or more in seats alone before a single gigabyte of telemetry, and the Compute Capacity Unit model adds query and alert charges that spike during the incidents when engineers run the most queries.

Balance AI innovation and governance with Sumo Logic AI and ML apps

AI is changing how teams work. Developers are generating code faster, security teams are automating investigations, and employees across the business are using AI tools to accelerate research, content creation, and decision-making. But this adoption comes with a catch. As usage explodes, it introduces a new set of security risks: a rapidly expanding attack surface, faster attack timelines, potential data exposure, and an alarming lack of visibility into how these tools are being used.

Seven Straight Years of Verified Customer Trust

Seven years ago, our customers started telling the world what the ScienceLogic AI Platform does for their operations. They haven’t stopped. For the seventh consecutive year, that steady stream of verified customer reviews has earned the ScienceLogic AI Platform a TrustRadius Top Rated award, again. Seven years in a row shows that customers keep choosing to share their experience because the platform keeps delivering value. This recognition doesn’t come from us.

G2 Names Auvik Network Management & Monitoring Leader Across Summer 2026 Reports

G2 reports are built around what customers say about the products they use. In the Summer 2026 Reports, that feedback helped Auvik earn top recognition across Grid Reports and Index Reports for Network Management Tools and Network Monitoring. Try Auvik Network Management Free to try! Setup takes less than 15 minutes and you will see results in an hour. Learn more now.

Automated Network Documentation 101: What You Need to Know to Get Started

Network documentation has a way of becoming everyone’s problem and nobody’s responsibility. Over time, diagrams become outdated, configuration changes go undocumented, and critical knowledge ends up living in the heads of a few senior technicians instead of somewhere the entire team can access it. That’s why organizations are turning to automated network documentation.

Store and search high-volume logs with ClickHouse and Datadog

As teams scale AI and agentic workloads, log volumes can grow fast. That growth can force teams into a difficult trade-off: Keep logs searchable in their existing workflows, or store them cost-effectively for longer periods. For teams that rely on logs during incident response, compliance reviews, and long-running investigations, losing either affordability or searchability can slow down troubleshooting. Datadog and ClickHouse are partnering to help remove that trade-off.

DASH 2026 Keynote

At, Datadog launched 100+ capabilities to help customers drive autonomy and manage growing AI and security complexity. From new Bits AI, log management, and security capabilities, customers have the visibility and autonomous operations they need to detect, investigate and resolve issues across the development loop and data lifecycle. Tune in to the full keynote to catch the highlights.

Getting started with Prometheus dashboards

Prometheus is a wildly popular open source monitoring tool typically used for monitoring Kubernetes environments and containerized workloads. But how do you turn the mountains of metrics into a clear picture of health and performance? SquaredUp plugs directly into your Prometheus database to visualize and monitor your data. What sets SquaredUp apart from other Prometheus visualization options like Grafana and Perseus is just how easy it is to visualize, monitor and share Prometheus dashboards.

The Real Cost of Custom Code: Why Buying a Unified Middleware Management Platform Protects Enterprise IT Budgets

Building custom middleware monitoring appears cost-effective but creates expensive maintenance debt, fragmented visibility, and operational risk. Enterprise teams spend 60-80% of IT budgets on software maintenance while unified platforms deliver immediate, production-ready capabilities.

Grafana Tempo: The distributed tracing journey to 3.0 (June 2026 Community Call)

Our distributed tracing journey from the inception of Tempo to 3.0. Can't comment in the chat? You may need to create a channel. Grafana Cloud is the easiest way to get started with Grafana dashboards, metrics, logs, traces, and profiles.

How Managed Digital Employee Experience (DEX) Supports Smarter Device Refresh Decisions

Let’s face it, refreshing devices used to be a guessing game. IT teams would swap out laptops and desktops on a fixed schedule, hoping to keep everyone happy and productive. But in today’s hybrid, cloud-first world, that old approach just doesn’t work. Employees expect seamless experience, and businesses can’t afford to waste money on unnecessary upgrades or risk productivity dips from outdated tech. That’s where Digital Employee Experience (DEX) comes in.

Graviton5 in Production at Honeycomb: Per-service Results From the m8g to m9g Migration

This is the fourth installment in the Graviton retrospective series we've been writing since 2021. The methodology is the same one I always reach for: hold the workload constant, run both generations on the same Kubernetes namespace concurrently, and let the per-pod numbers speak.

What is Automated Patch Management?

Learn why manual patch management creates unnecessary risk for IT teams and how automated patch management helps organizations improve security, compliance, and operational efficiency. Discover how automation eliminates repetitive tasks, reduces human error, prioritizes critical vulnerabilities, and accelerates patch deployment across the entire IT environment.

Why Your Vendor Monitoring Strategy Has a Blind Spot: The Case for Continuous TPRM

You monitor everything. Network traffic, application performance, authentication events, infrastructure health. If something meaningful changes in your environment, you have a signal for it. That discipline is foundational to how modern IT and security operations work. But there is one part of your stack you almost certainly cannot see in real time: your vendors.

Time to move to the StatusGator v3 API: What v2 users need to know

We launched the StatusGator v3 REST API back in October, and it has only gotten better since. v3 is a ground-up redesign built around organization-level API tokens, a consistent response format, opaque string IDs, pagination, and a large set of write endpoints for managing monitors, incidents, and subscribers. We have kept shipping new capabilities for it, and we will keep doing so. v2, on the other hand, is done.

How to Size Infrastructure When Hardware Delays and Cost Pressure Change the Equation

Sizing infrastructure has always required a balance between performance, capacity, and risk. What has changed is the level of precision required to make those decisions. Hardware timelines are less predictable. Costs are under closer review. Decisions that were once routine now require clear justification. In many cases, the question is no longer just how much capacity is needed, but whether that capacity can be delivered when it is needed and whether the investment will hold up under scrutiny.

Building Enterprise Momentum Across APAC: A Conversation with Dave Patnaik

There’s a lot happening across Asia Pacific right now. Enterprises are moving quickly to modernize operations, adopt AI, and manage growing complexity across increasingly distributed environments, and the opportunity ahead for LogicMonitor in the region continues to grow alongside it. That’s why I’m especially excited to welcome Dave Patnaik to LogicMonitor as our new Vice President of APAC.

Devolutions Makes an Important Investment in Obkio

Obkio is proud to announce an important investment from Devolutions, one of the most respected names in IT security and remote access management. This investment marks a new chapter for Obkio as we accelerate our next phase of growth. This isn't a financial transaction between strangers. It's a partnership between two companies that have spent years building tools for the same people: IT professionals, sysadmins, MSPs, and the network engineers who keep critical network infrastructure running.

Monitor Memory Where Allocations Occur

Kubernetes dashboards often mask a system infrastructure failure. When a critical application crashes, it often points to an Out-of-Memory event. Even while standard CPU metrics appear completely healthy. This quick walkthrough shows you how Coralogix integrates continuous memory profiling directly into your production environment. We pair OpenTelemetry trace data with continuous background sampling via the Async Profiler. It helps teams isolate resource heavy code paths before they trigger system degradation.

Infinite Cardinality Metrics: Custom metrics built for modern systems

Every technology shift adds new context you need to measure. Cloud computing added regions and services. Kubernetes added containers and pods. Multi-tenant applications added users and tenants. AI systems add models, prompts, agents, and execution paths. The result is that metrics are becoming dramatically more dimensional, faster than ever before. Over time, engineers are forced to make tradeoffs.

Native ASIM Ingestion for Microsoft Sentinel, Now in Bindplane

If you're sending security data to Microsoft Sentinel, you now have a faster path. A new ASIM mode lands your logs directly in Sentinel's native ASIM tables: no custom tables to predefine, no schema to design before data flows. We added ASIM mode to the Microsoft Sentinel destination, backed by a new ASIM standardization processor that converts raw logs to ASIM in the pipeline and routes each record to the table it belongs in. Here's how it works, and why we built it this way.

From API to live dashboard - building a SquaredUp plugin with AI

No matter how fast we build, we'll never integrate with every tool. There are too many, new ones appear constantly, and some are too niche to ever reach the top of our roadmap. So if the tool you care about isn't supported yet, your options have been to wait for us to get to it, or build it yourself with our Web API plugin — a powerful, flexible option, though one that asks you to map out the endpoints, authentication and paging yourself.

Building a Predictive Maintenance Plugin with the InfluxDB 3 Processing Engine

Predictive maintenance is one of the most compelling use cases for time series data. Instead of waiting for equipment to fail or servicing it on a fixed calendar regardless of condition, you watch the live sensor data and act when it indicates that a failure is coming. That “watch the data and act” loop is exactly what the InfluxDB 3 Processing Engine was built for.

Oops! All Robots | SolarWinds TechPod

Chrystal Taylor and Sean Sebring explore the fascinating world of robotics, from consumer devices to advanced medical robots, and discuss the future of humanoid and non-humanoid robots in our lives. Chrystal and Sean, with guest Andy Garibay, explore the future of robotics, focusing on form, function, and societal impact. They debate why certain designs, such as dogs, dominate, the potential of automation in construction and security, and the ethical considerations of humanoid robots.

Why Security Teams Spend So Much Time Reconciling Data

Security teams today are managing growing volumes of cybersecurity data across increasingly complex environments. This blog explores the hidden operational cost of disconnected tools, manual data reconciliation, and fragmented reporting, and how Teneo’s Cyber Asset Attack Surface Management (CAASM), powered by ThreatAware, helps organizations create a more unified and trusted view across their security estate. Most organizations are not short of security tools.

Why Most Organizations Still Don't Know What's Protected

Organizations invest heavily in cybersecurity tools, yet many still struggle to confidently understand what is actually protected across their environment. This blog explores how disconnected systems, unknown assets, and inconsistent data create blind spots, and how Teneo’s Cyber Asset Attack Surface Management (CAASM), powered by ThreatAware, helps organizations gain a trusted view of security coverage.

If You Are Building a Startup from a Vibe-Coded App, Don't Skip This #devops #programming #ai

Everyone is vibe coding products right now. But most applications are missing one crucial thing: Observability. In this video, I talk about: You can literally start this weekend: If you are turning your vibe-coded app into a real startup, observability should not be an afterthought.

DASH 2026 Operating at Scale: Guide to Datadog's newest announcements

A challenge for many teams continues to be managing cost, governance, and reliability across an ever-larger footprint. This year’s DASH announcements help teams operate efficiently at scale, with new tools to cut cloud and AI spend, eliminate waste automatically, maintain observability during outages, and manage many organizations and agents as a single unit.

Autonomously monitor for impactful degradations with Bits Detection

Monitoring is built around the system a team understands at a point in time. Engineers add endpoints, move dependencies, and change user flows every day. Over time, that creates coverage drift as monitors keep reflecting the system as it used to behave, while changing paths introduce failure modes that teams didn’t yet know to watch for. Bits Detection automatically creates, tunes, and maintains monitors for your services.

Get reliable answers to business questions with Bits Data Analysis

Teams are wiring AI coding agents straight to their warehouse over MCP and asking things like “What was our revenue by channel in Q2?” The agent finds a revenue table, runs a query, and returns a number in seconds, with no waiting on the data team. While the answer initially looks right, the problem is that the number is often wrong.

Turn Datadog findings into automated code fixes with Bits Code

Engineering teams lose hours in the gap between detecting a problem and getting a fix into review. An on-call engineer sees an error spike in Datadog, pivots to traces and logs to isolate the failure, opens the relevant repository, reproduces the issue, writes a fix, adds tests, waits on CI, and finally opens a pull request. Even when the problem is familiar, the workflow pulls engineers across several tools and stretches remediation from minutes into hours or days.

Deleting Em-Dashes : REALITY BYTES AI WORKPLACE SPECIAL ft. HR Leader Gabi Tofani

In this Reality Bytes special, Tom and Oriana welcome Nexthink’s Head of Global Talent Success, Gabi Tofani, to explore how AI is reshaping workplace culture, learning, leadership, and employee experience. From measuring AI adoption and building curiosity-driven cultures to the risks of “AI slop,” homogenized thinking, and performance reviews written by bots (for bots), the conversation examines what organizations might gain (and lose) as AI becomes embedded in daily work.

Best MSP Software in 2026: How to Choose the Right Platform

MSPs already have plenty of tools. The harder problem is getting a clear read on what’s happening across each customer environment, which alerts point to the same issue, and where engineers should start. Choosing the right MSP software is really about choosing the right operating layer for service delivery. MSPs are supporting more customers, more environments, and more alerts, but adding another tool doesn’t always make the work easier.

Automatically discover and remediate root causes with Grafana Assistant Investigations

You can use Grafana Assistant Investigations to automatically discover incidents and help find root causes—and this AI-powered Grafana Cloud feature recently got a major upgrade to give you even more confidence in its findings. You can read more about the behind-the-scenes effort in our new engineering blog Unprompted, where we get into harness engineering, context compaction, benchmarking, and keeping agents alive and working well in long-running sessions.

Scout Monitoring Now Supports Node.js: Express, NestJS, Prisma, and More

We have been getting the same request from teams for a while now: “We use Scout for our Rails app. Can we get the same thing for our Node services?” Today the answer is yes. Scout Monitoring now supports Node.js. If your team runs Express or NestJS in production, you get the same errors-and-traces experience that Ruby, Python, PHP, and Elixir teams have had. Let’s walk through what that means in practice.

What is DNS TTL and How to Choose the Right Value

DNS TTL is one of those settings nobody thinks about until it bites them. Then they think about it a lot. This guide explains what DNS TTL is, how it works in plain language, and how to pick the right value for your records. By the end you will know what to set, when to change it, and why it matters when you migrate to a new server.

The AI Bottleneck: Why Your Modern Models Are Choking on Legacy and Streaming Data Architecture

Enterprise AI struggles not from inadequate models, but from fragmented data architecture. Critical business data remains trapped in legacy systems or lost in streaming complexity. Success requires bridging the gap between modern intelligence layers and underlying systems of record.

Color-coded log monitoring for simplified log analysis

Modern production environments generate massive volumes of logs every day. As systems become more distributed and cloud-native, that volume only increases. The real challenge isn’t collecting logs—it’s identifying what matters fast enough to act using effective log visualization. Most log views fail at this point. Every entry looks the same, forcing engineers to scan them manually and interpret lines under pressure.

Running the OpenTelemetry Collector as a Lambda

The OpenTelemetry Collector is usually deployed as a long-running process: a sidecar, a DaemonSet, an EC2 instance, a docker container on my computer. It sits there listening for telemetry. That's fine when I want to send telemetry all day, but not when telemetry is rare. Like right now, when I have an agent defined on AgentCore, and it runs a few times a week maybe. Or my website that hardly sees any traffic. Can I run the OpenTelemetry Collector as a Lambda function?

It Can Only Goodhart Happen

When a measure becomes a target, it ceases to be a good measure. Charles Goodhart, 1975 You’ve probably read this quote in relation to any number of things over the years. People complaining about arbitrary metrics like PRs merged, lines of code produced, and now, token usage. But is the era of tokenmaxxing over before it even began? The rise of token leaderboards to the death of token leaderboards at companies like Amazon seem to have taken place in less than three months!

What is SRE Observability and Key Pillars You Should Know?

What happens when a critical service slows down, but nothing is technically “broken”? Most teams have monitoring in place. They know when something goes down. But when performance drops or issues spread across services, finding the real cause becomes slow and unclear. Engineering teams end up switching between dashboards, logs, and alerts just to understand what changed. This delays response and increases pressure on on-call teams. This is where SRE observability becomes essential.

Search and act across Datadog to resolve issues faster with Bits Chat

Finding the right information across dashboards, monitors, and telemetry sources takes time, even for experienced engineers. When something breaks, it often means figuring out where to start, rebuilding queries, and jumping between metrics, logs, and traces before you can take action. The challenge isn’t a lack of data but the effort required to surface the right information at the right moment.

Three Years a Leader. Thank You.

Dear Nexthink community, We are excited to be named a Leader in the 2026 Gartner Magic Quadrant for Digital Employee Experience Tools for the third year in a row. I want to share this recognition with our customers, our partners and ecosystem, and every Nexthinker across the world. As a founder, it’s a true honor to work alongside so many talented people. To us, this recognition is also yours.

Works on my machine: how we use AI to reproduce reported bugs

Sentry’s SDK teams maintain and support SDKs for a vast ecosystem of languages and frameworks. See our release registry for a source of truth. We’re currently at 159 published packages across the entire ecosystem. If you use it, we probably support it. All of these SDKs are open source and have their own GitHub repositories that we maintain on a daily basis. And like any other open source project, we get tons of bug reports and issues on these.

Top 10 Prompts for Your Monitoring Tool

You open a monitoring tool, and the data is all there: errors, traces, anomalies, incidents, and countless intricacies. If you want to get the right slice of that data, you need to know exactly which dashboard to open and what filters to apply. But when the poor UI gets in the way, this can take longer than it should. Luckily, this is not the case with AppSignal. MCP (Model Context Protocol) changes the interface entirely.

Introducing the StatusGator browser extension for Chrome and Firefox

We’re excited to announce the launch of the StatusGator browser extension, now available for both Chrome and Firefox. Whether you’re troubleshooting an issue, wondering if a website is down, or looking for more information about an ongoing incident, the extension gives you instant access to service status information with a single click. Simply install the extension and start checking the status of websites and services as you browse.

New: Introducing the StatusGator Chrome extension

We’re excited to announce the launch of the StatusGator Chrome extension, a new way to check the status of websites and online services directly from your browser. Whether you’re troubleshooting an issue, wondering if a website is down, or looking for more information about an ongoing incident, the extension gives you instant access to service status information with a single click. Simply install the extension and start checking the status of websites and services as you browse.

API update: Full board management now available

We’re excited to announce expanded functionality for the StatusGator Boards API. You can now create new boards, update existing boards, and delete boards directly through the API. Previously, the Boards API only supported listing boards and retrieving board details. With these new capabilities, you can automate the complete board lifecycle – from provisioning new boards to managing ownership and cleaning up boards that are no longer needed.

Turning Disconnected Alerts into Actionable Insights

The previous post in this series focused on shared context and why hybrid operations depend on a connected view across cloud, network, and infrastructure. Once that context is in place, the operational benefits become easier to see—especially during incident response, where signal volume and fragmented tooling can slow teams down. Alert noise remains one of the most persistent challenges in hybrid environments. Every layer of the stack can generate its own warnings, anomalies, and service events.

Network Device Monitoring: Topology Maps and NetFlow

Most teams run one tool for SNMP polling, another for topology, and a third for flow analysis, then spend their time stitching the views together. This webinar shows how Netdata brings all three into a single dashboard, with 100+ vendor profiles out of the box, automatic Layer 2 topology mapping, and a flow collector that auto-detects NetFlow, IPFIX, and sFlow on a single port.

Why Engineers Don't Trust Autonomous AI - 4th Annual Observability Survey | Grafana Labs

The 2026 Observability Survey from Grafana Labs heard from over 1,300 engineers and leaders across 76 countries on the real-world role of AI in observability. The data reveals a sharp distinction between intelligence and autonomy — and a critical blind spot most teams have.

Asimov's Zeroth Law of Robotics: testing and observing AI (ExpoQA 2026)

Asimov's Three Laws of Robotics are missing one — and when it comes to testing and observing AI, Nicole van der Hoeven argues that missing rule changes everything: before a robot can avoid harm, obey orders, or protect itself, there has to be a Zeroth Law: a robot must be observable. Because if you can't see what a system is doing, you have no way of knowing whether it's following any rule at all.

11 Incident Management Best Practices Every IT Team Should Follow

A well-defined incident management process can mean the difference between a minor disruption and a major business outage. When critical services fail, every minute of downtime matters. Yet many IT teams still face challenges such as unclear ownership, poor prioritization, communication gaps, alert fatigue, and manual processes that delay resolution. The result is longer outages, missed SLAs, and frustrated users.

Progress Wins at the Network Computing Awards

Progress has been named a winner at this year's Network Computing Awards, earning industry recognition for its ongoing commitment to innovation and delivering real-world value to customers. A standout event in the UK technology calendar, the Network Computing Awards celebrate organizations and solutions that are driving measurable impact across the industry.

Errors, traces, logs, metrics: when to reach for what

When should I reach for a log, a trace, or a metric? I hit that question constantly when I instrument code, and I watch coding agents hit it too. It sounds like it should be obvious. Errors, traces, logs, and metrics are the four kinds of telemetry most apps run on, four tools in one box, and they overlap enough that the honest answer is every developer’s favourite: it depends. You can stuff context into span attributes instead of logging it. You can count log events instead of emitting a metric.

Zero Friction, Zero Tickets, Zero Disruption: The New Operational Mandate for IT

For decades, IT operations have followed a familiar model. Specialized teams manage different parts of the environment, from infrastructure and networks to security and endpoint management. When employees encounter issues, they submit tickets to the service desk, which are then triaged, escalated, and resolved. This structure has endured because it provided a reliable way to maintain system health and respond to problems as they arise.

How Digital Experience Monitoring Protects Your Paid Social ROI

The Australian digital advertising market is experiencing an unprecedented era of growth. Recent industry data shows that internet advertising investments have reached a staggering record of $18.4 billion. Furthermore, over 77 percent of Australians are now regular social media users, spending nearly two hours every single day on various digital platforms. In response to this captive audience, marketing teams spend immense amounts of time and budget crafting the perfect creative, targeting precise demographics, and optimising their ad bids across platforms like Meta, LinkedIn, and TikTok.

Keeping Critical Systems Online Across Dynamic Operational Locations

Keeping critical systems online has always been a technical challenge, but the scale of that challenge shifts considerably when operations span multiple physical locations, none of which are fixed. Field sites, temporary installations, marine vessels, mobile command units, and dispersed industrial assets all place unique demands on the infrastructure designed to keep them running. In these environments, avoiding downtime and maintaining business continuity is not simply a matter of patching software or monitoring a server room.

Escaping the Diderot effect: How to avoid tech-driven spending

Top Tips is a weekly column where we explore emerging trends in technology and share practical ways to stay ahead. This week, we're looking at how technology can nudge us into unnecessary spending—and how to avoid it. Have you ever bought one thing and then felt the need to buy several more to match it? If so, you've experienced what is known as the Diderot Effect. The term comes from the life of Denis Diderot, a famous French philosopher who spent much of his wealth in a matter of months.

Shopify outage affects stores, admin panels, and APIs on June 3, 2026

On June 3, 2026, Shopify experienced a widespread service disruption that affected merchants and customers across multiple regions. Users reported storefront failures, admin dashboard issues, API connectivity problems, and authentication errors that disrupted ecommerce operations for several hours. While the outage did not affect every Shopify customer, reports quickly began arriving from around the world, indicating a significant platform issue.
Sponsored Post

How APM fits into the modern observability stack

Most engineering teams don't have a data problem. They have an interpretation problem. Prometheus is running, logs are shipping to the aggregator, dashboards are green-and then a latency spike hits and the root cause takes 45 minutes to isolate. The data was there but the answer wasn't. That gap is where application performance monitoring (APM) operates. This article explores what APM adds to a modern observability stack, why relying on standalone tools leaves critical blind spots, and how teams can unify infrastructure data with application context for a complete operational picture.

Autonomous IT Is Here. Are You Prepared?

Enterprise IT was built for a more predictable workplace, where support began when an employee reported a problem and IT worked backward from the details they could provide. That model made sense when devices, applications, and ways of working were easier to control. Today, the digital workplace moves too quickly for IT to rely on reported issues alone. By the time a ticket appears, employees may have already lost time, worked around the problem, abandoned the tool, or turned to an unmanaged alternative.

Claude Code Observability at Scale: How We Did It With Bindplane

At Bindplane, we iterate fast. One of the most important tools we've adopted across our organization is Claude Code. It helps every team here build solutions to complex problems with both speed and precision. But speed without visibility is a liability. We needed a reliable way to monitor and audit how Claude Code was being used across our team. Luckily, we build the best platform on the market for data in motion.

How to Build a Cost-Effective Log Retention Strategy

Nearly every home has that drawer or doom corner where you store all those items that you don’t need every day but that you still want to keep for those “just in case moments.” If you’re a document connoisseur, you may have financial documents that go back years because an accountant once warned you that an IRS audit would require seven years of back documentation. In short, you have a lot of documents that you may or may not need taking up a lot of room in your home.

Introducing Bits Agent Builder: Build agentic workflows for alert response and remediation

Building automated workflows that adapt to real-world complexity can be a challenge. As systems scale and scenarios multiply, teams often end up hardcoding endless logic branches just to handle every potential outcome. That’s why we’re introducing Bits Agent Builder, a powerful new tool that lets you create custom AI agents that are fully hosted by Datadog.

AI Observability Deep Dive Demo | Grafana Cloud

Grafana AI Observability is our new database and platform for observing AI Agents. Over the past year at Grafana Labs, we built Agents and we needed a way to understand how they are performing, what are the costs associated with them, what's the error rate or time to the first token as well as how they are behaving. Grafana Staff Engineer, Ivana Hučková provides a deep dive demo on how Grafana AI Observability connects our experience building Agents with our experience building observability systems.

Anomaly Detection and Forecasting That Learns From Every Write in InfluxDB

For many operational time series workloads, machine learning can’t operate in the historical way, where data is compiled once and models are trained offline. Sensor readings, infrastructure metrics, application telemetry, energy data, industrial measurements, and financial ticks all share a basic property: the next datapoint is more useful when the system can respond to it immediately (or at least close to immediately).

Automating Device and OS Compliance in Air-Gapped Networks with Agentic AI

For network operations and security teams, maintaining compliance across device hardware and operating systems is a complex and time-consuming task. At any given moment, your network contains thousands of devices from dozens of different vendors. To keep this infrastructure secure, you must constantly know which devices are approaching end-of-life (EOL) milestones, and which platforms are vulnerable to active common vulnerabilities and exposures (CVEs).

Internet Performance Monitoring: Understand Digital Experience from the User's Perspective

Internet Performance Monitoring (IPM) provides end-to-end visibility into what happens between your infrastructure and your users, across networks and services you don’t own or control. The internet is your network now. Your apps live in the cloud, your users are everywhere, and the systems that deliver your applications and services to them use hundreds of providers, ISPs, and networks beyond your control. In practice, that means infrastructure monitoring is the foundation.

Why Observability Is Essential for Platform Engineers?

Observability is how platform teams stop being the answer to every question and start building platforms that answer those questions themselves. This article explains specifically how observability enables platform engineers to support development teams better which reducing ticket volume, cutting MTTR, enabling SLO ownership, and making microservice debugging something devs can do without escalating to you.

Grafana Assistant Context Offloading

Context Offloading is a pipeline solution for managing Observability with AI Agents. If you are building AI Agents that work with real data, the context window can very easily get filled with bloated context that the Agent does not really need. Sven demonstrates "Context Offloading", a solution that stores the JSON result and sends only the summary of the JSON blob, making the LLM loop performance much quicker and keeping your context window small.

What is Cloud Infrastructure? Everything You Need to Know

Modern businesses need infrastructure that can scale as quickly as their demands change. Yet many organizations still struggle with infrastructure that is costly to maintain, difficult to expand, and slow to adapt to new requirements. As applications, users, and data continue to grow, managing resources efficiently becomes increasingly challenging. Cloud infrastructure provides a more flexible approach.

How to debug REST Collector APIs with Cribl REST Collector Diagnostics

This video introduces the new REST Collector Diagnostics feature in Cribl, which helps you troubleshoot API collection issues faster. It’s designed for observability and data engineers who use REST Collector to pull data from external APIs and need deeper visibility into HTTP requests, responses, and errors.

Observability for Healthcare Systems | Grafana Everywhere

Grafana Assistant is going places you might not expect — including healthcare. Golden Grot winner Oren Lion from TeleTracking reveals how Grafana Cloud supports their systems that help keep patient care moving — and how Assistant enables teams to get from “what happened?” to “here’s why” faster. From moon landings to patient care, Grafana is everywhere. Congratulations to Oren, Chris Johnson, Mark Munson, and the entire TeleTracking team on winning this year's Golden Grot Award for Pioneering AI in Observability!

Speed with Confidence: Managing Delivery Risk in an AI-driven Development World

In the modern development landscape, we are seeing a shift in how work is managed. The rise of AI-assisted development and highly distributed teams means that work is moving faster than ever before. However, this increased velocity often comes with a hidden tax: complexity. We are seeing more parallel work streams, more intricate dependencies, and a constant stream of shifting priorities. In this environment, simply moving fast is not enough to guarantee success.

Getting Started with NinjaOne dashboards

If you manage endpoints for a living, you'll know the problem isn't a lack of data. It's that there's too much of it, scattered across too many places. A modern IT team or MSP might be looking after thousands of devices spread across dozens of customer organizations, each generating a constant stream of alerts, patch results, antivirus events and disk warnings. NinjaOne does a great job of collecting all of that.

The Silent Killer of IBM MQ: How One Leaky App Can Crash Your Entire Estate

A single leaky application can crash your entire IBM MQ estate by consuming OS resources through unclosed connections. Traditional monitoring misses these silent killers. Learn how proactive observability detects OPPROCS anomalies before they trigger infrastructure failures.

Apache ActiveMQ 5.19.7 and 6.2.6

On May 27, the Apache ActiveMQ project shipped two releases on the same day: 5.19.7 and 6.2.6. Look at the changelogs side by side and the story is clear — this isn’t a feature drop. It’s a coordinated security-hardening pass applied to both maintained branches of ActiveMQ Classic at once, with the same fixes deliberately backported so that no supported line is left behind.

Upgrading to ActiveMQ 5.19.7 or 6.2.6

The latest Apache ActiveMQ releases – 5.19.7 and 6.2.6, both from May 27 – are good releases to apply. They close known dependency CVEs and tighten the broker’s default posture. (We covered the full list of changes in our release overview.) But here’s the catch with any “secure-by-default” update: hardening defaults means turning things off.
Sponsored Post

Increase customer retention & stop leaving money in the shopping cart

We all know the pain and frustration associated with broken software. It's no secret that the internet is rife with broken links, slow pages, and broken shopping carts, often feeling like it's being held together with glue and duct tape. These issues aren't just causing frustration for customers; it costs businesses millions. According to the Consortium for Information and Software Quality, poor software quality cost US companies $2.08 trillion in 2020. Every interaction between a customer and your technology is an opportunity to build or destroy trust.

DevEx Talks ep 5 - Accessibility in open source and beyond

In this episode of DevEx Talks, Mike Gifford shares his perspective on accessibility in today’s tech landscape, drawing from his extensive experience in the field. Together, we explore how well accessibility is currently defined, whether the industry is truly meeting the needs of professionals who rely on it, and what gaps still exist. We also discuss the growing importance of accessibility within Developer Relations and where the biggest opportunities lie to create more inclusive tools, communities, and workflows.

How to generate real-world load tests using Grafana Cloud k6 and production telemetry

For many development teams, a load test starts with a set of assumptions. You pick 100 virtual users because it sounds reasonable. You ramp for 30 seconds because that's what the tutorial showed. You set a 500ms threshold because it feels like a good target. The test passes, you ship the release, and production falls over at 6 p.m. on a Tuesday because your synthetic load never resembled how real users interact with your application.

Your AI App Is Lying to You - Here's How to Fix That #devops #observability #programming

You shipped your AI app. But do you have all the answers? Do you actually know which model ran, how many tokens it consumed, or why it stopped? This is what LLM observability gives you, and most AI engineers are skipping it entirely. I built an SOS detection app and used OpenTelemetry to get full visibility into every single call. Token usage, model version, finish reason, and cost per call all in one place, standardised across any provider. Check out the OpenTelemetry GenAI docs in the link below; there is a lot more you can track than you think.

Autonomous Error Remediation in Cursor with Lightrun MCP

Lightrun's Gidi Freud demonstrates how your AI coding agent can now investigate and fix production errors, autonomously. Watch how Cursor, guided by Lightrun's Error Remediation skill, picks up a Sentry error, instruments the live service with a runtime snapshot, captures real evidence, and opens a validated PR for approval.

What Enterprise AI Gets Wrong About Usage

AI is moving out of the experimental phase and into the everyday rhythm of work. Teams are no longer using it occasionally for novelty or quick wins, but instead are exploring more robust use cases to investigate issues, answer questions faster, surface context, and help them move through complex workflows with more confidence. That’s the shift that most organizations’ leadership teams have been asking for.

Cribl Search Pack for Zscaler: Setup & security dashboard walkthrough

Learn how to install and configure the Cribl Search Pack for Zscaler, then walk through prebuilt dashboards for your Zscaler security logs. This video is for security engineers, Zscaler administrators, and SOC/observability teams using Cribl Search to monitor and investigate Zscaler activity. In this walkthrough, you’ll see: If you need a reminder or want to share feedback on the pack, you can always refer to the README bundled with the pack or reach out to the Cribl team.

The Hidden Cost of Network Blind Spots (and How to Fix It)

Even the smallest gaps in infrastructure visibility can lead to major impacts to an enterprise. And with modern IT environments becoming more complex it creates rising expectations for uptime. Our recent webinar, The Hidden Cost of Network Blind Spots and Alert Noise, covered this exact topic. The Progress WhatsUp Gold product experts explored why traditional monitoring falls short and best practices to moving toward smarter, more proactive network management.

Best APM for Small Teams Without Dedicated DevOps in 2026

You don’t have an SRE. There’s no platform team. Your “monitoring strategy” is someone checking Slack for error alerts. When production breaks, the same two or three senior devs drop everything to debug. Sound familiar? Most APM tools are built for organizations with dedicated operations staff. They assume someone has time to configure dashboards, tune alert thresholds, and learn a complex query language. That person does not exist on your team.

DNS Spy Now Has an MCP Server. Ask Your AI About Any Domain.

DNS monitoring should be simple. You want to know if something changed. You want to know if a record propagated. You want to know if a phishing site just went live with your brand name in the domain. But in practice it takes work. You log in to a dashboard. You click through menus. You run a check, copy the output, paste it somewhere else. You repeat that process every time someone on the team asks a question. AI assistants like Claude and ChatGPT could help.

Best Error Monitoring for Rails in 2026

You deploy on Friday. Sidekiq starts failing on a job that worked fine in staging. Your error tool shows you a NoMethodError on line 47. But it doesn’t tell you that the job only fails when processing records created after the migration you ran on Thursday. The stack trace is correct and completely useless at the same time. This is the core problem with general-purpose error monitoring on Rails apps. Rails teams deal with N+1 queries that cascade into timeout errors.

IBM Think 2026 Infrastructure Insights for IT Leaders

IBM Think 2026 made one thing clear: infrastructure leaders are being asked to support more AI, more automation, and faster decision-making without adding unnecessary complexity or risk. Held earlier this month in Boston, IBM Think 2026 focused heavily on enterprise AI, hybrid cloud, automation, governance, and operational transformation.

May 2026 product updates

We’ve been busy shipping new features and enhancements to help you monitor critical services more effectively, investigate incidents faster, and customize your StatusGator experience. This month’s updates include historical outage reports, our new Datadog integration, expanded monitoring coverage in Asia Pacific, improved email branding options, and performance upgrades for monitor metrics. We also crossed a major milestone with more than 8,000 services now monitored by StatusGator.

Service Desk Automation: What It Is and How to Get Started

How much of service desk work is problem solving and how much is repeat work that continues every day? Most service desks follow the same pattern daily. Password resets, access requests, software installs, approvals, and routine fixes keep coming in. These tasks are simple on their own, yet together they take most of the team’s time and push important incidents further down the queue. The main challenge is the constant flow of repeat work that reduces time for focused tasks.

How Support Uses Honeycomb to Debug Honeycomb

You'd think that working at an observability company means everyone knows exactly where to find everything in the data. It doesn't. Especially not on the support team. We're the ones who get the tickets. We're in the telemetry every day trying to figure out what went wrong for a customer, and we do that by pointing Honeycomb at itself. Here's how that actually works, and how it's changed.

The Observability Journey: Getty Images and Cribl

I recently sat down with Simon Overbey and Lovepreet Singh - the Engineering Manager and systems engineer (respectively) at Getty Images to talk about their experiences implementing Cribl. After getting a rundown of the pre-Cribl environment (described above) I asked to jump straight to the end, the net benefits. If the "before" was a terrifying tidal wave of cost and complexity, what did the "after" look like?

Federated Search | From Silos to Insight | Azure Blob Schema Discovery with Splunk's Crawler

This walk-through shows how Splunk's Cloud can discover schema and partition keys for Microsoft Azure Blob Storage datasets and create searchable Splunk managed tables. Once the data is mapped, analysts can use Splunk Federated Search to query Azure Blob data where it lives, bringing cloud-resident logs into security, observability, and operational work-flows without re-ingesting the data.

DataPrime at ingest (DPXL): See the impact of any routing decision

TCO policies have always been one of the most impactful cost levers in Coralogix. Route business-critical data to High, push monitoring data to Medium, archive compliance logs to Low. With the addition of DataPrime expressions (DPXL) – a subset of the DataPrime query language designed for inline filtering at ingest – that routing became even more precise, matching on any field in the event payload, not just application, subsystem, and severity.

Lightweight Server Monitoring - One Binary, No Stack

Monitoring a single server should not require running four daemons. Yet the default open-source recipe for “I just want to watch this one box” still looks like this: install node_exporter, stand up a Prometheus server to scrape it, add Grafana to draw the graphs, and bolt on Alertmanager so you actually hear about a full disk. That is a lot of moving parts — and a lot of YAML — for one machine. This post shows a lighter path.

You don't need a paid plan to use AI Root Cause Analysis

When an error appears in production, the hardest part often isn’t seeing what broke. It’s understanding why. That’s why we built Root Cause Analysis (RCA). It helps connect the dots between an error and its likely cause, so you can spend less time investigating and more time moving forward. Until now, RCA was only available through plans that included AI credits. Starting today, free plan users can purchase an AI credit subscription and use RCA without changing plans.

Splunk Observability at Cisco Live: Agentic Observability for the AI Era

Observability has always been about seeing clearly under pressure. But the pressure has changed. Applications are more distributed. Kubernetes environments keep expanding. Digital experiences depend on services, APIs, networks, third-party providers, and now AI models and agents that can make decisions faster than a human team can review every signal.

How LivePerson optimized Logstash and Kafka performance on GCP through benchmarking

By benchmarking five GCP machine types across both Logstash and Kafka, LivePerson's observability team found that infrastructure selection (not just pipeline configuration) is one of the highest-leverage cost optimization decisions at scale.

Observability Summit NA 2026: What the Community Is Thinking About

Two days in Minneapolis with the OpenTelemetry community, talking about where telemetry pipelines are headed and what the AI wave is doing to them. Two topics dominated everything: AI and cost reduction. Not as separate conversations, either. The more the community talked about AI telemetry, the more the cost question followed right behind it. I joined Diana Todea from VictoriaMetrics and Antonio Jimenez Martinez from Cisco ThousandEyes on the Telemetry That Matters panel.

May 2026 Early Warning Signals

In May 2026, StatusGator detected 854 Early Warning Signals across SaaS, cloud, developer, and infrastructure services. Of those incidents, 695 were never acknowledged by providers, while 159 were eventually confirmed on official status pages. Throughout the month, StatusGator’s Early Warning Signals continued to surface emerging outages before many providers published updates, giving teams valuable time to investigate and respond.

Microsoft DNS management in OpUtils: One console for complete control

For network administrators, managing DNS has traditionally meant juggling zones and records across separate server interfaces, manually tracking changes, and responding to resolution failures after they’ve already caused disruption. We’re excited to introduce Microsoft DNS management in ManageEngine OpUtils, bringing DNS zone and record administration directly into the same console you already use for IP address management (IPAM).

What a Forrester TEI study on Edwin AI actually tells IT leaders-and how to use it

This blog helps IT leaders use the Forrester Consulting TEI study as a practical framework for evaluating Edwin AI in their own environments. A Total Economic Impact study is useful for one, critical reason: it takes a broad technology claim and turns it into a financial and operational framework. That matters in AI for IT operations because the market is crowded with claims. Every platform says it reduces noise. Every platform says it improves efficiency. Every platform says it helps teams move faster.

15 DevOps Metrics Every Engineering Team Should Track in 2026

Software moves from code to production more quickly today, but it is still difficult to tell whether delivery is actually improving or just becoming more active. Most teams rely on dashboards filled with metrics like deployments, uptime, failures, and tickets. The numbers are available, but the meaning behind them is often unclear. DevOps metrics become useful only when grouped into clear categories: DORA metrics cover only delivery speed and stability, which is just part of the picture.

Migrate to Azure Managed Redis with Datadog and Eden

Azure Managed Redis is a Microsoft first-party, fully managed in-memory data store, replacing Azure Cache for Redis tiers. It includes Redis Enterprise features such as RediSearch for vector search and full-text search, in addition to RedisJSON, RedisTimeSeries, and Active Geo-Replication. As Azure Cache for Redis reaches end of life, more teams are planning migrations to Azure Managed Redis in search of better performance, lower cost, and modern capabilities for AI and real-time workloads.

How we cut Spark compute costs by 44% with agentic AI and Datadog Jobs Monitoring

Spark jobs only get more expensive and harder to debug as they scale. It’s a problem we’ve run into ourselves. Our Referential Data Platform team builds and maintains the knowledge graph that maps relationships between customers’ observability entities. ServiceQueryEdge is at the center of that graph, mapping service entities to their associated metric and log queries.

Shifting Streams and AI Surges: What Our Data Reveals About the OTT Landscape

OTT data from early 2026 shows streaming hierarchies holding steady while AI platforms reshuffled rapidly. Claude has substantially increased traffic since January, overtaking Gemini, and is on pace to challenge ChatGPT by fall. Doug Madory digs into the data in this new analysis.

Inside the Grafana AI Team Weekly: AI Observability for the OTel demo and LLMSpec (May 12, 2026)

This is an excerpt from a real AI team weekly meeting where we talk about the stuff we build and occasionally also demo them! In this one, Principal Software Engineer Sven Großmann demos how he integrated AI Observability into the OTel demo, complete with the guards feature he introduced last week, and Principal Software Engineer Yas Ekinci gives a rare glimpse of LLMSpec, the internal counterpart of the o11ybench benchmark that we use to evaluate Assistant.

What's New in Tempo 3.0

Tempo 3.0 introduces a major architectural shift that decouples the read and write paths, with Kafka handling durability on the write side and a new live store serving recent traces on the read side. Blocks are now written at a replication factor of one instead of three, significantly reducing storage overhead. This release also brings TraceQL metrics to general availability, adds comparison operators for filtering metric results at query time, and introduces a new Tempo CLI redact command for removing sensitive trace data on demand without waiting for retention to expire.

A deep dive into AWS data perimeter misconfigurations

In AWS environments, a data perimeter is a set of preventative controls that help ensure that your trusted cloud identities (principals or AWS services acting on your behalf) are accessing trusted resources from authorized networks. You can apply these controls at various levels of your infrastructure, such as per resource or across all resources in your AWS account.

Tempo 3.0 release: a new architecture for scale and lower TCO, TraceQL metrics GA, and more

Tempo started with a simple goal: make distributed tracing easier to run at scale. As tracing adoption has grown, however, so have the challenges, including higher data volumes, more complex architectures, and increasing demand for real-time insights directly from traces. Over the last year, we’ve been evolving Tempo’s architecture to meet that moment. And today, we’re sharing the results of those efforts with the release of Tempo 3.0.