Scaling Site Reliability Engineering Teams the Right Way
Most SRE teams eventually reach a point in their existence where they appear unable to meet all the demands placed upon them. This is when these teams may need to scale. However, it’s important to understand that increasing team capacity is not the same as increasing the number of people on the team. Let’s unpack what scaling a team is all about, what are the indicators, what are steps you can take, and how you know if you’re done.
Scaling Triggers
Sometimes it is very easy to tell whether you need to scale your team or not. For example:
- When the team is assigned more services to manage,
- Traffic or users have significantly increased, or
- Service Level Objectives (SLOs) have become more demanding
In the above situations it is usually obvious that the team needs to scale.
In other situations, the signs that you need to scale are more subtle and often ambiguous. Here are a few things that may be indicators that your team needs to scale:
An increase in toil: A repetitive task that create no long-term value and need to be actively controlled. Automation, run books, and retrospectives all reduce toil. However, when a team is under pressure, it will have no slack to think about quality of life improvements like toil reduction. It will be constantly scrambling to maintain reliability and fulfil business objectives.
A decrease in reliability or performance: Similar to toil, reliability and performance need to be actively managed. When teams are over stretched they often react to SLO breaches rather than proactively initiating performance or reliability projects.
Improvement projects are delayed or canceled: Increase in toil, a decline in performance or reliability can be symptoms of a more general problem: neglecting long-term planning in favour of reacting to short-term issues. Another symptom of this is when any kind of improvement project is de-prioritized in favour of feature development.
Decline in the team’s morale: People in teams that need to scale are usually overloaded, stressed, and close to burn out. This, in fact, is the number one reason to scale your team since losing people is among the most difficult problems to recover from.
All of these indicators are not conclusive and can have other causes. You need to be sure that you are solving the correct problem. It can be very tempting to see manpower as a blanket solution for all problems, but it can worsen the problem and leave you with a trickier problem of scaling down.
Adding people to your team should be the last thing you do after exhausting all other options. This is not only more prudent financially, but it also ensures that you are not ignoring problems that could become more difficult to address over time.
When thinking about any technical initiative, it is useful to break it down using the People-Process-Tools model. This assumes that the most important factors that impact an initiative, in order of importance, are People, Processes and Tools. Let’s look at them in chronological order.
Process
Before starting a scaling effort, you should know what metrics you are trying to improve and how you should be measuring them. It is an engineering axiom that you can’t optimize what you’re not measuring. The exact metrics to look at will vary from team to team and from situation to situation, but here are a few to start with:
- Actual performance against SLOs
- Project metrics:
- 80th percentile wait time
- 80th percentile cycle time
- Average daily queue size
- Mean time to acknowledge (MTTA)
Once you are measuring your key metrics, institute a process to frequently evaluate your performance on those metrics. It might be as simple as taking a few minutes in every sprint retrospective for this purpose.
Don’t underestimate the value of processes to help you scale. Many smaller teams often use simplistic, ad-hoc processes. Engineers often dismiss processes as undesirable overhead. This misses the raison d'etre of processes to reduce error and improve efficiency.
Management Processes
- Toil Limits ensure that toil reduction tasks are prioritized.
- Postmortems identify measures to prevent the repetition of incidents.
- Agile methods like Kanban ensure management processes themselves are efficient.
- Reports like finger charts can help identify bottlenecks.
Engineering Processes
- Alert Noise Reduction quietens noisy alerts and prioritizes them. This reduces the effort needed to manage incidents.
- Alert Routing ensures that only the appropriate people are notified about incidents.
- Automation reduces toil and errors.
- Pairing aids knowledge transfer and reduces errors.
- Infrastructure as Code improves repeatability and reduces errors.
Tools
The subject of SRE tools is pretty vast – too large for this article. So rather than going into a potentially lengthy discussion of specific tools, let's discuss how to think about tools in the context of scaling.
Different kinds of tools have different kinds of scaling impacts. It is important to have hard data that indicates what kind of improvement is necessary. This data may be in your project management or trouble ticket system, but more often than not you will need to get feedback from your team.
In general, there are a few kinds of results that you should expect from the tools that your team is using:
Tools that help you handle more load with the same team
This could be anything from pssh to ansible that helps you handle large fleets of servers, VMs or containers. Modern monitoring tools not only perform better at scale, they are often easier to configure too. Incident Management tools like Squadcast prioritize and deduplicate incidents allowing engineers to focus on critical tasks.
Tools that reduce rework by reducing errors
Script libraries, runbooks and runbook automation systems all facilitate task repeatability – allowing tasks to be executed reliably as frequently as needed. Using containers to implement immutable servers ensures that subtle errors caused by config drift are avoided.
Tools that eliminate certain kinds of work
Container orchestration systems like Kubernetes eliminate huge swathes of work
– everything from setting up process supervisors like supervisors for managing load balancers.
Distributed tracing systems like OpenTelementry reduce the need for complex log aggregation systems to track transactions through distributed systems.
Tools that help delegate work
Tools like RunDeck allow secure, guard-railed, role-based access to scripts. This allows dependent teams like developers or customer support to work independently without adding to the SRE workload.
Similarly, tools like Metabase, Kibana and Grafana can be used to provide self-service access to production data, logs or metrics to Product Management, Customer Support, or Management.
Providing senior management with the ability to answer their own questions is a particularly powerful way to reduce a lot of high priority, low value-add effort.
There are no silver bullets
Avoid the idea that tools are a panacea. Introducing new tools can be financially burdensome and disruptive. If introduced unwisely they can easily make your team worse off. This is why a clear cost-benefit analysis is necessary before investing in new tools.
People
Once you have exhausted all other options to increase your team’s capacity you then have to start adding people to your team.
Capacity Planning
Capacity planning is more an art than a science, requiring a combination of hard data and judgment calls. There is no sure fire method to build the perfect capacity plan. But here are some tips:
- Use data about your existing load to make projections. This can be in ideal man hours or story points. Relate that to the services under management. You should be able to say something like, “Adding another microservice will add about 50 hrs of project work per quarter” or “We currently have 80 story points of demand every sprint versus 60 points of capacity.” You have to be able to approximately quantify and reason about the current and projected loads.
- Factor in the relative productivity and cost of seniors vs juniors. Juniors often take longer on tasks than seniors. Seniors often have other responsibilities like code reviews, mentoring, or interviews. As with load, you should be able to quantify and reason about capacity.
- High utilization, defined as the ratio of task hours to available working hours, is not a good measure of efficiency. Less slack time implies fewer creative hours for innovation and improvement. It’s also likely to lead to frustration and burnout. Try to plan for 30% slack.
- While it might be a good idea to plug all these numbers into a spreadsheet to make your projections, do not lose sight of the fact that these are only rough approximations of reality. Ensure that you are conservative in capacity projections and liberal in demand projections. Add buffers liberally. It’s always better to end up with slightly more capacity than you need than slightly less.
Team Composition
There are a couple of major factors to consider when planning the composition of your team:
Experience: Balancing out the experience mix of your team requires a set of trade-offs. In general, we can bucket people into juniors, intermediates, and seniors. The definition of these buckets in terms of years of experience and capability will vary depending on your local labor market, tech stack, and business domain. Somebody with 10 years of experience managing Go microservices might be considered senior, but the similar experience on nuclear power station systems may be junior.
Juniors are less expensive and less productive while seniors are the opposite. So why not staff completely with that happy medium – intermediates? This idea ignores the special value that both seniors and juniors add. Seniors’ experience allows them to quickly solve problems without reinventing the wheel and, more importantly, teach others while doing it. Juniors are future intermediates who don’t need to be un-trained on bad habits picked up elsewhere.
The best compromise is to build your team around a core of intermediates, with a small number of juniors and seniors to round it off. A proportion of 20:60:20 of juniors, intermediates, and seniors might be a goal to strive for.
Diversity: Even if you ignore the moral imperative to support groups that have historically been discriminated against, there are good operational reasons to seek diversity in your team. Multiple perspectives contribute to greater creativity and innovation. There’s also some anecdotal evidence that diverse teams are better behaved and more professional than the testosterone-fuelled boys clubs that non diverse teams can occasionally become.
Culture Fit: “Cultural Fit” has often been a tool of convenience to exclude those who don’t conform to a preconceived notion of what an engineer should be. In my book there is only one fundamental purpose of a cultural fit check and that is to exclude jerks. Nothing saps a team’s productivity like a negative individual who constantly creates petty conflicts or belittles team mates. It’s important to filter out jerks during the recruitment process itself and to get rid of them quickly if identified later. Don’t give high performing jerks a pass – their productivity rarely makes up for the drop in performance they create in the team.
Candidate Sources
Where can you hire from? One good way is to poach them from elsewhere in your company. They’re often a known quantity and usually much cheaper than external hires. Many traditional organizations have System Administration, Build or DevOps Teams that have people who would make good SREs. Software developers can bring engineering rigor to the team.
Usually, though, internal hiring would just move the scaling problem to another team. The most effective candidate sourcing mechanisms vary from place to place but here are some important ones:
- Employee referrals
- Recruitment consultants
- Job boards
- Advertising
- Careers page on your website.
In general, employee referrals are cheaper and have a better hit rate than all other mechanisms because they are pre-filtered by the employee. Ensure that you have rewards and incentives to encourage them.
Increasing capacity via hiring is time consuming and fraught with uncertainty. Ideally, you should start months in advance of the projected growth. Unfortunately most of us don’t have that luxury, so it is critical that you have contingency plans in place to handle hiring delays.
Conclusion
Scaling SRE teams is a challenging exercise that requires extensive analysis and planning. Adding people is slow, expensive, and risky so consider process or technology improvements to tide you over. When you start hiring it pays to use plan capacity requirements with data rather than gut instinct. Be thoughtful about the composition of your team as it can be critical to long term success.
Squadcast is an incident management tool that’s purpose-built for SRE. Get rid of unwanted alerts, receive relevant notifications and integrate with popular ChatOps tools. Work in collaboration using virtual incident war rooms and use automation to eliminate toil.