Learning from Failures: Better Crash Reporting for Better Incident Response
Crash events are one of the more serious problems that can occur when operating a service. Crashing components often cause cascading failures and service outages. To reveal the magnitude of damage and help prevent future occurrences, visibility into crash events is critical. Unfortunately, debugging crashes is one of the more complicated endeavors. The state of a crashed process is often compromised and the process can’t be trusted to collect debugging information on its own.