The latest News and Information on Service Reliability Engineering and related technologies.
Learn about the key roles within an incident response team, as well as optional incident roles you may not have thought about.
The role and responsibilities of a site reliability engineer (SRE) may vary depending on the size of the organization. For the most part, a site reliability engineer is focused on multiple tasks and projects at one time, so for most SREs, the various tools they use reflect their eve-evolving responsibilities. A typical SRE is busy automating, cleaning up code, upgrading servers, and continually monitoring dashboards for performance, etc., so they are going to see more tools in that toolbelt.
A comparison of EKS, AKS, GKE, Rancher and OpenShift from an SRE’s perspective.
Catchpoint is proud to present the top SRE tools as voted on by SREs. In our fourth annual SRE Survey, compiled in partnership with VMware Tanzu Observability and DevOps Institute, we simply asked, “What are a few tools that every SRE should have available in their toolbelt?” Today, we are excited to share the findings with you. While some of the answers were not strictly tools, the analysis gives us valuable insight into the mindset of an SRE.
Facebook’s October 2021 outage was the type of event that gives SREs nightmares: A series of critical business apps crashed in minutes and remained unavailable for hours, disrupting more than 3.5 billion users around the world and costing about 60 million dollars. As incidents go, this was a pretty big one.
xMatters is part technology, part service reliability, and a little bit of magic. If you’ve spent time on the xMatters website, you’ll likely have seen a number of valuable use cases for the platform—it can alert SREs when there’s a website outage, it can accelerate product development for DevOps teams, it can manage on-call schedules and alerts for support teams.