Operations | Monitoring | ITSM | DevOps | Cloud

November 2019

On-call doesn't have to be stressfull

“Being on-call is a critical duty that many operations and engineering teams must undertake to keep their services reliable and available. However, there are several pitfalls in the organization of on-call rotations and responsibilities that can lead to serious consequences for the services and the teams if not avoided.

The importance of GameDays

GameDays were first coined by Amazon’s “Master of Disaster” Jesse Robbins when he created them intending to increase reliability by purposefully creating major failures on pre-planned dates. Game Days help facilitate the values of chaos engineering. Chaos engineering is the disciplined practice of injecting failure into healthy systems. With modern IT services becoming increasingly sophisticated continuously changing systems, outages are inevitable.

Site Reliability Engineering-Why you should adopt SRE

Site reliability engineering was a term coined by Google engineer Benjamin Treynor in 2003 when he was tasked with making sure that Google services were reliable, secure and functional. He and his team eventually wrote the book on SRE which is available online for free for anyone interested in research and implementation of SRE best practices.