Operations | Monitoring | ITSM | DevOps | Cloud

The Perfect Storm: How We Talk About Disasters

Failures are inevitable. Every once in a long while, those failures can become major outages so big that they can cause irreversible damage to your company's brand and reputation. During these rare events, how you communicate with customers can make or break the valuable relationships you've built with them over the years. But when the blast radius of a technical outage is so big that it requires involvement from other parts of your company (like legal, marketing, and sales) many companies inadvertently make problems worse.

Topping top! New Real-Time Process Monitoring

What are the essential things to monitor in your infrastructure? Sure, CPU utilization, memory usage, and IO throughput. However, once you notice a significant load somewhere in your infrastructure you want to know what is causing it, and that typically boils down to needing to find the process that’s using too much CPU or memory or that’s doing disk or network IO like there’s no tomorrow.

Kubernetes issues and solutions

Hi all! I am a part of the architecture team at Avito.ru, one of the world’s top classifieds (read more about Avito here). In this post I want to share our experience in implementing kubernetes at scale. Kubernetes is a powerful orchestration tool that helps us manage dozens of microservices, support robust and fast deploy. It’s really cool that we don’t have to manage resources manually, think about service discovery and so on.

LaborDuty: Incident Response For Baby's Arrival

Real-time operations is a term PagerDuty uses to describe the process in which people can acknowledge, communicate, resolve, and learn from impactful events—all in real time. What can be a more impactful and real time than the miracle of childbirth? Whether it’s your first or fifth child, things don’t always go as planned, but the experience also generally comes with a good story filled with hindsight.