Embracing failure and chaos to improve system reliability and SRE team performance
In this interview with Alex Hidalgo, Field CTO at Nobl9 and author of Implementing Service Level Objectives (O’Reilly Media), we explore how traditional metrics like MTTR and MTTx can give a false sense of reliability.
Alex shares how SRE teams can embrace failure, build psychological safety, and design systems that reflect the human factor behind uptime, outages, and real-world reliability.
0:00 Alex Hidalgo on redefining reliability
1:00 How Alex stumbled into SRE and the Google experience
3:00 The problem with “mean time” metrics
5:10 Why counting incidents doesn’t equal reliability
7:30 MTTR and MTTx: why averages mislead teams
10:40 The human factor in outages and recovery
13:10 Incentives gone wrong – the cobra-effect of metrics
14:50 Bringing reliability early into product design
16:40 Blameless culture and psychological safety in SRE
19:20 Personal story: deleting production data
21:00 Guardrails, process fixes, and learning loops
23:10 Why embracing failure improves system health
Additional Resources:
Learn more about Elastic Observability: https://www.elastic.co/observability
Start the 14-day trial for free! No credit card required: https://cloud.elastic.co/registration
Subscribe to Elastic’s Community YT channel: https://www.youtube.com/c/OfficialElasticCommunity
Connect with us on social media:
LinkedIn: https://www.linkedin.com/company/elastic-co
X: https://twitter.com/elastic
Facebook: https://www.facebook.com/elastic.co
About Elastic
Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.
#ElasticSearch #ElasticObservability #ElasticSecurity