Hiteshwar is an SRE based out of Mumbai, India. His area of specialization is in distributed systems. He works on Kubernetes, running his own custom clusters, maintaining them and creating tools to manage and monitor them. He likes to share his learnings by writing articles and blogs on Medium and Linkedin. He is an active speaker in meetups and developer groups and also teaches DevOps and SRE practices at learning centers.
Arild Jensen, SRE Manager at Upwork, talks about his journey into SRE and some best practices he picked up along the way including implementing a blameless culture, code review and making decisions based on hard data.
When something goes down, the first thing a customer does is check if there is something wrong with their systems or if it is an issue with one of their service providers. So it’s important to make sure that your status page has all the information that is needed where they don’t feel the need to raise an issue or create a ticket, adding to your support costs.
A status page is a communication tool that allows you to display the current working status of your various services - whether fully functional, partially degraded, severely affected, etc. The nomenclature of the service status can be defined by you. On the status page, you can also access & update the uptime and incident history data for all your internal facing or customer impacting components.
Alert noise is a very common on-call complaint leading to fatigue and on-call burnout. This article is an attempt at helping folks address this problem.
Many organisations already possess a vast amount of existing data about production systems. As customer expectations evolve, organisations are often challenged to find more proactive ways of dealing with traditionally reactive incident response activity. In this talk, we discuss approaches to unlock value from this data by making it truly actionable.