Operations | Monitoring | ITSM | DevOps | Cloud

Telemetry pipeline management at any scale: Fleet Management in Grafana Cloud is generally available

We announced Fleet Management in Grafana Cloud last year to solve the pain points that come with managing dozens, hundreds, or even thousands of telemetry collectors across departments and environments. And today we’re excited to announce that Fleet Management is generally available for all Grafana Cloud users who need help managing telemetry collector deployments at scale.

How a major retailer tested critical serverless systems with Failure Flags

Not too long ago, a customer came to us with a high-value use case. The customer, a major apparel company with retail and e-commerce applications, needed to prove that a critical service of their payment applications could failover correctly between regions in case of an outage. But there was one snag: the service was built using AWS Lambda. This meant infrastructure-focused tests would have trouble replicating the failure conditions necessary to test the failover due to Lambda’s serverless model.

Easy debugging with Laravel breadcrumbs and Honeybadger

If you're building web applications and care about your users, Laravel breadcrumbs can help you debug why you're seeing an error, giving you greater insight into what users are experiencing. It's easy to take advantage of this feature and add breadcrumbs without much extra configuration, especially if you're already using Honeybadger. Here's a quick walkthrough.

What is agentic AI? The role of AI agents in DevOps automation

Agentic AI represents the next evolution in artificial intelligence systems, characterized by autonomous software entities that can independently pursue goals, make decisions, and take actions with minimal human supervision. Unlike traditional AI models that respond only to specific prompts, AI agents actively observe their environment, learn from feedback, and execute complex sequences of tasks to achieve defined objectives.

Why the Common Vulnerability Scoring System (CVSS) Is Necessary - But Also Insufficient

Measuring the risks posed by vulnerabilities — to the greatest degree of accuracy — is no simple task. It’s common for organizations to use the Common Vulnerability Scoring System (CVSS) by default, to come to terms with the size and scope of vulnerabilities. But while CVSS is a useful tool, it’s not immune from its own vulnerabilities.

Prometheus Port Configuration: A Detailed Guide

Setting up Prometheus should be straightforward, but when metrics stop flowing, it’s usually something simple—like a port issue. Misconfigure it, and suddenly, your whole monitoring setup feels like a guessing game. This guide breaks down how to configure Prometheus ports properly, whether you're sticking to defaults or need a custom setup.

Syslog Monitoring: A Guide to Log Management and Analysis

Relying on syslogs to debug issues at odd hours? It happens to the best of us. A solid syslog setup isn’t just about collecting logs—it’s about making them useful. This guide walks through setting up syslog, configuring it for better visibility, and using monitoring techniques that actually help when things go wrong. No fluff, just practical steps you can use right away.

Performance Impact of High Cardinality in Time-Series DBs

Time-series databases have become the backbone of modern observability, financial analytics, and IoT systems. But there's a common challenge that can bring even the most robust systems to their knees: high cardinality. When your database starts tracking millions of unique values across various dimensions, performance doesn't just dip—it can collapse entirely. Let's understand the technical details of what happens when cardinality spikes and how you can architect your systems to handle it.