Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Sponsored Post

How to improve your Crash Free Users score in minutes

If you're reading this blog, you likely already know the importance of quality software. But with the overwhelming number of metrics that can be monitored and improved, development teams are struggling with what metrics they should prioritize to have the most significant impact. The Crash Free Users score in Raygun is a perfect place for development teams who care about software quality to focus their efforts. It tells you what percentage of users didn't encounter a crash or error while using your software and is an ideal north star to gauge the overall quality of your software.

Detecting incidents without components

StatusGator monitors services and their individual components, so you can stay informed about the systems you rely on – and filter down to only the components you care about. Most status pages do a good job of tagging incidents to the affected components. But sometimes providers publish incident updates without marking any components as impacted, even when the incident clearly affects something real.

January 2026: IsDown Users Saved 9.2 Hours with Early Outage Detection

In January 2026, IsDown's early detection system gave users a cumulative advantage of 9.2 hours across 34 incidents — that's over half a business day of advance warning before vendors officially acknowledged their outages. The largest single detection advantage? A massive 2.2 hours for a SendGrid email delivery issue that left customers in the dark while their emails failed to reach Microsoft inboxes.

How an AI assistant and MCP server deliver real-time cloud cost insights

Cloud costs don’t grow quietly. They spike, drift, and surprise teams at the worst possible moments, usually when someone finally opens a dashboard. While cloud cost management tools are powerful, getting quick answers often still means navigating multiple views, applying filters, exporting reports, and looping in the right people. But what if cloud cost analysis worked more like a conversation?

What is agentic AI? (explained in 60 seconds)

Agentic AI is the next evolution of artificial intelligence. Unlike traditional AI, it can act autonomously and make decisions on its own. Here’s what that actually means, without the hype. Additional Resources: About Elastic Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

AI NetOps: How AI and Machine Learning Transform Network Operations

AI is changing network operations (NetOps) from static automation into adaptive, data-driven systems that can summarize incidents, retrieve knowledge, and guide remediation with human oversight. In this talk, Phil Gervasi breaks down what “AI for NetOps” really means in practice, including the difference between classical ML and large language models (LLMs), why data pipelines matter more than model tuning, and how patterns like RAG (retrieval augmented generation), text-to-SQL, and agentic workflows turn raw telemetry into decisions.

Heartbeat behind the metrics | Muraleedharan on support, scale, and seeing the product in the wild

What does observability look like when you’re responsible for customers at scale? In this episode of Heartbeat Behind the Metrics, Muraleedharan Sadhasivam, Head of Customer Success, talks about his 15-year journey at ManageEngine and the perspective you only get from being close to customers every day. He shares why custom dashboards matter so much, and why AppLogs is a feature he wishes more users explored to complete the MELT story. From querying logs to turning them into alerts and dashboards, he explains how real insights start when data is brought together.

Track cyber security with Reports in Digital Risk Analyzer

Discover how Site24x7’s Digital Risk Analyzer Reports help you instantly uncover vulnerabilities and assess multi-domain risks. In this quick walkthrough, learn how to view domain health, generate detailed or consolidated reports, schedule automated delivery, and share PDF insights with your team. Perfect for IT admins, DevOps, MSPs, and business leaders who want fast, actionable visibility into their cybersecurity posture.

Continuous profiling in production: A real-world example to measure benefits and costs

Continuous profiling offers deep visibility into production environments, revealing exactly how applications consume CPU and memory. It’s the go-to observability practice for directly connecting system behavior and performance to specific lines of code. But when teams consider deploying continuous profiling more broadly, a common question comes up: what’s the overhead? Is it safe to run continuous profiling on my production services 24/7, or does the cost outweigh the benefits?