Operations | Monitoring | ITSM | DevOps | Cloud

May 2020

Kubernetes Operators for Automated SRE

It can be quite challenging for an SRE team to maintain the well-being of a large-scale Kubernetes based system with hundreds or thousands of services. In this blog post, Gigi Sayfan, author of “Mastering Kubernetes”, outlines the SRE challenge and how we can achieve the ultimate goal of automated SRE with Kubernetes operators.

On-call On-boarding Checklist

And it starts with the company culture. Irrespective of how small or large your team is, it’s wise to invest some time in creating a good on-call onboarding plan. A humane on-call is the mark of a good engineering culture. Being on-call means that you’re expected to be reachable for any issues that may occur during your shift. It’s easy to lose any and all motivation by just anxiously anticipating that mid-dinner ping.

Best Practices in Incident Management

In an always-on world, companies look to systems and processes to keep their services up and running at all times. The most important part of maintaining this uptime is having an Incident Management process in place to restore your services in the event of an interruption or unplanned downtime. Incident Management processes are typically used by SRE, DevOps, NOC and other IT teams to respond to incidents that affect services and work on restoring their uptime.