Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Convert OpenTelemetry Traces to Metrics using SpanMetrics Connector

What if your have already implemented tracing but lacks robust metrics capabilities? Enter SpanConnector: a tool that bridges this gap by converting trace data into actionable metrics. This post details the workings of SpanConnector, providing a guide on its configuration and implementation.

Cribl's Blueprint for Secure Software Development.

What does it take to build software for the most security-demanding customers worldwide? At Cribl, building secure products is integral to our engineering identity. We have established a secure software development lifecycle that is both culturally and policy-driven, integrating product security tooling and processes into every architecture review, pull request, and release, whether major or minor.

Introduction to Ingesting Logs into Loki with Fluentd and Fluent Bit | Zero to Hero: Loki | Grafana

Have you just discovered Grafana Loki and plan to use FluentD or Fluent Bit as your telemetry collector? Or are you trying to decide which agent is right for you? In this "Zero to Hero" episode, we cover the basics of FluentD and Fluent Bit, highlighting their differences and helping you determine when to use one over the other. Additionally, we guide you through configuring both agents' Loki plugins to write logs directly into Loki.

Learning Moment: Effective Customer Communication During Incidents - Enhance Visibility & Response with Uptime.com

The recent global outage caused by an operating system update reminded me of how vulnerable we are today and most importantly, how close we are always teetering on global scale incidents with millions of interconnected dependencies. When the base of the house collapses, everything built on top is impacted. Those of us in IT Operations, Monitoring, Observability (insert the current acronym), etc., know firsthand this risk; we face it every day.

Chaos Testing Explained

Chaos testing is a part of site reliability engineering (SRE). In chaos testing, we intentionally break things in and around a given application, in order to: The purpose of chaos testing is to assess how software systems respond to scenarios like network outages, hardware failures, database failures, and server or cluster node failures in the infrastructure.

How OTel Empowers You to Handle Unified Data

Discover the power of OpenTelemetry to consolidate your telemetry data. Our expert-led workshop demonstrates standardization techniques for metrics, logs, and traces. Delve into real-world applications, including capturing Prometheus metrics, managing logs with FluentD/Bit, and collecting traces with Jaeger.

Global Microsoft Outage and Preventing Future Vulnerabilities

In a recent unexpected turn of events, a faulty component in the latest CrowdStrike Falcon update led to widespread outages, crashing Windows systems globally. The repercussions were felt across various sectors, including airports, TV stations, hospitals, and even emergency services in the U.S. and Canada. The glitch, affecting both Windows workstations and servers, resulted in massive outages, bringing entire companies to a standstill and crashing fleets of hundreds of thousands of computers.