An infrastructure, especially if it is scalable, can become extremely complex to visualize and observe. If something goes wrong, it would be difficult to fully understand the problem without a great data monitoring strategy. Information related to CPU, RAM, and statistics about SSH or HTTP servers are critical to understanding the performance of your web-application.
You’re ingesting 20,000 data points a second, in 400,000 metrics, from thousands of AWS instances – and your monitoring can’t handle the load. You need a scalable, highly-available monitoring and dashboarding solution (and you need it yesterday). Should you do it yourself with an in-house Graphite or Prometheus monitoring system? Or will you skip the headache and choose a hosted service like MetricFire?