You've joined a company, or worked there a little while, and you've just now realised that you'll have to do on-call. You feel like you don't know much about how everything fits together, how are you supposed to fix it at 2am when you get paged? So you're a little nervous. Understandable. Here are a few tips to help you become less nervous.
In a recent experiment with my colleagues, I polled them about the following: “What would they do if the lights went out as you worked at night?” Besides identifying the funny and who-you-want-in-case-of-an-emergency responses, most of my colleagues checked to see if the problem might be broader than their own home.
Being on-call is never fun, especially when a high-priority customer reports an issue. When you’re on call and get paged, your immediate action is likely to jump into your solution of choice (preferably Lightstep), query whatever real-time data you have on hand to investigate the issue, and narrow down the possible problems.
Background We recently released the biggest overhaul to one of the core features of Spike.sh - On-call schedules. Software teams use on-call schedules to designate first responders who will handle issues when they occur.
On-call planning is one of the most popular features in Enterprise Alert and is widely used by users, team managers and administrators. However, in our discussions we keep finding that it is not simply done with 5 minutes of planning. Scheduling often depend on external systems. This can range from a simple excel form provided to HR all the way to a comprehensive billing system such as SAP. As a result, it takes a quite a bit of time to transfer the planned shifts to third-party systems.
We’re excited to present a feature update to the OnPage platform. The new update will bring more flexibility and resiliency to a team’s on-call management workflow. With the new scheduling capabilities, OnPage system administrators can create exceptions to configured, recurring on-call schedules.
An on-call schedule tells you and everyone in the team who will be the first responder when an issue happens in production. The on-call team member is responsible for investigating the issue, either fixing the issue herself or adding other people who can help fix it. Having an on-call schedule is important for building reliable systems because making someone responsible for production issues makes sure that they're not ignored.
The always-on, always-available expectations of digital services have increased the requirements of technical teams to be ready and provide response around the clock. For teams new to this concept, introducing on-call can be stressful and complex. As part of PagerDuty’s main platform, on-call management is key to our business, but the non-technical aspects are also important for teams to consider.
Incident management is the process used by developer and IT operations teams to respond to system failures (incidents) and restore normal service operation as quickly as possible. Incident is a broad term describing any event that causes either a complete disruption or a decrease in the quality of a given service. Incidents usually require immediate response of the development or operations team, often referred to as on-call or response teams in incident management.