Service Level Objectives or SLOs are one of the fundamental principles of site reliability engineering. We use them to precisely quantify the reliability target we want to achieve in our service. We also use their inverse, error budgets, to make informed decisions about how much risk we can take on at any given time. This lets us determine, for example, whether we can go ahead with a push to production or infrastructure upgrade.
Our Stackdriver Monitoring tool works on Google Cloud Platform (GCP), Amazon Web Services (AWS) and even on-prem apps and services with partner tools like Blue Medora’s BindPlane. Monitoring keeps metrics for six weeks, because the operational value in monitoring metrics is often most important within a recent time window. For example, knowing the 99th percentile latency for your app may be useful for your DevOps team in the short term as they monitor applications on a day-to-day basis.