Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Designing Alerts for Action

In the first two posts of this series, we explored how alert noise emerges from design decisions, and why notification lists fail to create accountability when responsibility is unclear. There’s a deeper issue underneath both of those problems. Many alerting systems are designed without being clear about the outcome they’re meant to produce. When teams don’t explicitly decide what they want to happen as a result of a signal, they default to the loudest option available.

Unlimited Team Sizes for All

Starting from today, Healthchecks.io users on all plans (Hobbyist, Supporter, Business, Business Plus) can invite an unlimited number of users into their projects. Previously, the limits were: 3 team members for Hobbyist and Supporter, 10 team members for Business, and unlimited team members for Business Plus. From now on, it is unlimited for all.

Top 6 Cloud Monitoring Challenges in Hybrid & Multi-Cloud Environments

Hybrid and multi-cloud monitoring breaks down when teams can’t connect signals to customer impact fast enough to act. Hybrid and multi-cloud sound simple: run some workloads in public cloud, keep some on-premises, and connect it all. But in practice, you’re managing dependencies across teams and systems, tools that don’t share context, and incidents that refuse to stay in one place.

Turn Raw Data into Reliability by Changing Performance Perspectives

In a global microservices architecture, technical performance initially presents as a chaotic stream of disconnected telemetry. For a Technical Program Manager (TPM), success depends on the ability to move past these disconnected individual data points to identify stable patterns. If they have services entering critical states, looking at individual logs or traces is inefficient. Protecting system reliability requires an engine that can automate pattern recognition at scale.

Productivity in the Age of AI - DEXOps 1:1 with Scott Pope

In the first of a new rotating expert series, Scott Pope (Nexthink's Director of Value Advisory) joins to explore DEXOps, productivity, and why DEX has firmly entered the boardroom conversation. We talk about how the market has evolved, what AI is really changing, how to communicate value to senior leaders, and the story behind the DEX Productivity Report. Also: Arsenal. Briefly. And yes, Tom still needs to update the show music. Hang in there.

Unlocking business resilience with full-stack observability in hybrid IT environments

For CIOs and technology leaders across the Gulf Cooperation Council (GCC), full-stack observability is a strategic lever for achieving faster ROI, operational resilience, and digital maturity. By integrating AI-powered insights and automation, IT leaders can streamline operations and align technology outcomes with business goals. Demonstrating ROI within tight timelines is critical, as is leveraging observability to maintain competitive advantage in a rapidly evolving market.

OpenTelemetry support for .NET 10: A behind-the-scenes look

At Grafana Labs, we are fully committed to the open source OpenTelemetry project and are actively engaged with the OTel community. Many Grafanistas spend a large proportion of their time contributing directly to OpenTelemetry upstream projects, helping make observability more powerful, reliable, and accessible for everyone as part of our big tent philosophy.

Teaching AI How to Refinery

At the beginning of February, we released v3.1 of Refinery, our advanced, tail-based sampling solution. The new version comes with more performance enhancements, bug fixes, and a few new pieces of telemetry. In tandem with the 3.1 release, we also released a new tool for our MCP server which helps your AIs understand Refinery, and how Honeycomb handles sampling.