Embracing failure and chaos to improve system reliability and SRE team performance

Nov 3, 2025

In this interview with Alex Hidalgo, Field CTO at Nobl9 and author of Implementing Service Level Objectives (O’Reilly Media), we explore how traditional metrics like MTTR and MTTx can give a false sense of reliability.

Alex shares how SRE teams can embrace failure, build psychological safety, and design systems that reflect the human factor behind uptime, outages, and real-world reliability.

0:00 Alex Hidalgo on redefining reliability

1:00 How Alex stumbled into SRE and the Google experience

3:00 The problem with “mean time” metrics

5:10 Why counting incidents doesn’t equal reliability

7:30 MTTR and MTTx: why averages mislead teams

10:40 The human factor in outages and recovery

13:10 Incentives gone wrong – the cobra-effect of metrics

14:50 Bringing reliability early into product design

16:40 Blameless culture and psychological safety in SRE

19:20 Personal story: deleting production data

21:00 Guardrails, process fixes, and learning loops

23:10 Why embracing failure improves system health

Additional Resources:
Learn more about Elastic Observability: https://www.elastic.co/observability

Start the 14-day trial for free! No credit card required: https://cloud.elastic.co/registration
Subscribe to Elastic’s Community YT channel: https://www.youtube.com/c/OfficialElasticCommunity

Connect with us on social media:
LinkedIn: https://www.linkedin.com/company/elastic-co
X: https://twitter.com/elastic
Facebook: https://www.facebook.com/elastic.co

About Elastic
Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale. Elastic’s solutions for search, observability, and security are built on the Elastic Search AI Platform — the development platform used by thousands of companies, including more than 50% of the Fortune 500.

#ElasticSearch #ElasticObservability #ElasticSecurity