The latest News and Information on Service Reliability Engineering and related technologies.
DevOps teams and site reliability engineers (SREs) contend with a never-ending flood of notifications and alerts about outages, potential threats, and other incidents. Companies rely on their DevOps teams to not only keep abreast of all the notifications but also to identify and prioritize the critical alerts and resolve problems in a timely manner. Yet in 2021, International Data Corporation (IDC) reported that companies with 500-1,499 employees ignored or failed to investigate 27% of all alerts.
High Cardinality woes are far & frequent in today's modern cloud-native environment. What does it mean, & why is it such a pressing problem?
How to filter metrics by labels using OpenTelemetry Collector.
Whoever owns Reliability should define its parameters. But who owns the Reliability of a Product? Engineering? Product Management? Or the Customer success team?
From Robocars to Reliability — SRE with self-driving cars; mapping out where the Observability space is in conjunction with self-driving cars.
The Reliability industry needs a managed, non-vendor lock-in answer to spiraling costs, high cardinality and the toil of managing a tsdb.
Most SRE teams eventually reach a point in their existence where they appear unable to meet all the demands placed upon them. This is when these teams may need to scale. However, it's important to understand that increasing team capacity is not the same as increasing the number of people on the team. Let's unpack what scaling a team is all about, what are the indicators, what are steps you can take, and how you know if you're done.