There is a moment in every company when 24x7 support is needed. Congrats! The next step is to start building an on-call team. In this article, we'll go through some of the aspects you should consider. We'll keep it small and, in a future article, go deep into each step.
If you haven't had the chance check our previous article from this series on "What is on-call? Why is it important?".
1. What is on-call for your company? Define it!
This is a crucial first step. First, you need to stop and think. What does on-call means for us? What do you want to achieve?
Several companies ignore this step. They hope everyone has the exact definition and the same understanding. It never happens. So you should define it and ensure everyone in the company understands it.
Usually, significant problems arise if this is not set correctly in the beginning. The company's default behavior will be to treat on-call with a "fix everything that happens" attitude. Calling someone, or triggering an alarm, because one user is mad that they can't log in. Sure it might be a big problem, but if you have 1 in 1.000.000 users with that problem, is it that important?
So from this step, you should have a set of rules of what on-call is. And most important, what on-call isn't.
2. Talk to your team
On-call is a big jargon on IT. People are afraid of it. It can usually mean waking up in the middle of the night. Not be able to go to the movies. Be always with your computer, ready to go into action.
Not saying these don't happen, but usually, people will make a bigger problem than it is.
So talk with them. Understand their fears and concerns. Don't try to sell them on-call. Just listen to your team.
3. Set up rotations and scheduling
Considering your service needs and your expected on-call team size, you need to start setting up rotations and shifts.
A shift is the number of hours/days that someone will be on-call. It can be whatever you want and works best for people. It can be one whole week, five weekdays, and one weekend.
A rotation is an algorithm you will use to set how shifts will rotate between people.
These need to be balanced to maintain people in a good state to their "normal" day of work
4. How to compensate the on-call team
There are multiple strategies for setting up compensation:
- No extra compensation. It's part of the wage.
- Set compensation for a shift. Every shift that a person takes gets $X.
- Set compensation for each on-call incident. People get payment when it's needed for them to intervene. If nothing happens, there's no compensation.
- A mix of shifts and incidents.
There are no actual rules for setting up compensation. Usually, it's decided based on your expectations and current reality. Every strategy has its set of pros and cons.
5. Incident escalation
This will happen eventually. Something happens that the on-call person is not able to fix on their own. They will need help.
Imagine a security incident when there's a breach in the middle of the night. Would you leave it in the hands of a single person with all the details and processes needed in that situation? Multiple people from multiple teams would need to be brought to solve the crisis.
You need to set up a process for when this happens. What should trigger this, and what steps to follow? Create guidelines for people to follow at those events.
6. What tools do you need?
The process can be complex, and you do not want to manage it by hand. You will need tools for:
- Scheduling and Alerting (PagerDuty, Incidents.io)
- Monitor and trigger events (Datadog, NewRelic, IsDown)
- Documenting the processes (Notion, Confluence)
- Communication (Zoom, Slack)
- Communicate with Customers (Statuspage, BetterUptime)
7. Monitor and Iterate
The reality is that after you set up everything, there's still a lot of work needed.
Keep an eye on how people are feeling. Gather feedback. It's very common for people eventually start getting tired. Understand those signs and adapt.
You need to manage the process. Companies evolve, new systems will come alive, and others will die. Reevaluate your approach from time to time, and adapt to reality.
You won't make it perfect in the first moment, not even in the second. You need to start and do it and then understand the caveats for on-call in your company. Adjust and proactively make changes that improve the process for the company and the people.
This is a simple introduction to the steps to need start running on-call in your company. There’s a lot more to say, and we will go more deeply into each topic, in incoming articles, during the next weeks.
Don't miss your 3rd-party services outages! IsDown is a status page aggregator & outage monitoring tool for all your business-critical dependencies. Provide your DevOps team with real-time outage information for all your tools and cloud providers. Instant notifications on outages. Visit IsDown.app to learn more.