Operations | Monitoring | ITSM | DevOps | Cloud

Google Operations

How to use metrics scopes in Cloud Monitoring

You've got Cloud Monitoring all set up in your project - but what do you do if you need to manage multiple projects and unify monitoring across them? In this episode of Engineering for Reliability, we look at Cloud Monitoring metrics scopes and show you how to use them to monitor multiple Cloud projects. Watch to learn how to use the Cloud Console to manage Metrics Scopes, view metrics from resources in multiple projects, and automate configurations using the API!

Google Cloud Monitoring 101: Understanding metric types

Whether you are moving your applications to the cloud or modernizing them using Kubernetes, observing cloud-based workloads is more challenging than observing traditional deployments. When monitoring on-prem monoliths, operations teams had full visibility over the entire stack and full control over how/what telemetry data is collected (from infrastructure to platform to application data).

10 years of cloud infrastructure with Eric Brewer

In this video, Google Cloud Developer Advocate, Stephanie Wong, speaks with Google Fellow, Eric Brewer, about his experience building infrastructure, including Kubernetes, over the last decade at Google. You’ll get a window into what it was like to help propel Kubernetes into one of the largest open source projects today.

Observing container environments with Cloud Operations

Did you know GKE isn’t the only place you can run containers in Google Cloud? In this episode of Engineering for Reliability, we show three options for running containers, as well as how to instrument each one for observability with Cloud Operations. Watch to learn how Cloud operations can help visualize metrics and analyze logs emitted by container workloads running on GKE, on Cloud Run, and on an Anthos cluster!

Better Kubernetes application monitoring with GKE workload metrics

The newly released 2021 Accelerate State of DevOps Report found that teams who excel at modern operational practices are 1.4 times more likely to report greater software delivery and operational performance and 1.8 times more likely to report better business outcomes. A foundational element of modern operational practices is having monitoring tooling in place to track, analyze, and alert on important metrics.

Monitoring compute infrastructure with the Cloud Ops Agent

How can you improve observability for workloads that use compute infrastructure directly and run on Google Compute Engine instances? In this episode of Engineering for Reliability, we show how you can use the Cloud Operations agent to do just that. Watch to learn about the Cloud Operations Agent, how to install it manually and automatically, and how to use the data it collects to improve the reliability of your services - and keep your users happy!

Maintaining reliable services with advanced Cloud Logging features

We’ve covered ingesting, routing, storing, and viewing logs from your services in Cloud Logging already, but what else can you do with all that data? In this episode of Engineering for Reliability, we show how you can use advanced features like alerting on logs, logs-based metrics, and capturing application exceptions in Error Reporting. Watch to learn how you can find issues faster, make your services more reliable, and keep your users happy.

How Lowe's SRE reduced its mean time to recovery (MTTR) by over 80 percent

The stakes of managing Lowes.com have never been higher, and that means spotting, troubleshooting and recovering from incidents as quickly as possible, so that customers can continue to do business on our site. To do that, it’s crucial to have solid incident engineering practices in place. Resolving an incident means mitigating the impact and/or restoring the service to its previous condition.

Understanding Apigee API Monitoring

Want to make sure the APIs you’ve launched on Apigee are performing as expected? In this video, we show how API Monitoring provides real-time insights into API traffic and performance, so you can solve problems as they happen. Watch to learn how you can stay informed and understand unusual events or patterns.