Businesses need to be prepared for minor and major incidents to happen to their technologies, be it an integration disconnecting or an entire system being taken offline. Preparation ensure that not only can losses be minimized, but they can protect themselves and potentially their clients from risky impacts.
Experts don’t always agree on what constitutes a major incident, but you can create a definition for your company. A major incident interrupts your company’s ability to function, sometimes completely shutting you down. Most often, the incident is human-made, be it hackers stealing data, emails infecting your system with ransomware, or employee errors that introduce catastrophic failures. When these major incidents occur, you need to immediately set your team in motion.
Major incident management is at the heart of company effectiveness, so let’s look at how to handle these processes with ease.
Identify Your Major Incidents
It’s important to come to a company-wide consensus on what issues constitute a major incident for your business. These can be both situations that are recurring or have happened in the past, but also worst-case scenario issues that haven’t happened before but should be planned for, just in case. Be specific so that you can quickly determine whether an incident really needs to be treated as a major issue.
DevOps teams and Site Reliability Engineers often assign budgets for allowable downtime for certain services so they know how many people to allocate when there is a problem. Getting the technical budget right can take many iterations, so be patient if you’re just starting.
Create an Incident Team
Instead of relying on your usual company hierarchy, many organizations establish a major incident team comprised of key players in your organization. Typically, someone from the IT side of the business leads the team, and some businesses go as far as creating a role for the Major Incident Director. Many departments will have a senior-level employee as part of the incident team, to ensure business-wide alignment; consider departments with oversight in tangential areas, such as Communications, Customer Success, and social media, all of whom will need to be aware of incidents and the response protocols.
The complexity and urgency of these situations create a need to use a collaboration platform to contact the right on-call resource immediately, and to escalate quickly to the next person if necessary.
Develop a Process
Experts stress that your company must develop a process to deal with major incidents that is separate from your other business protocols. Doing so will help limit the fallout by allowing staff to quickly identify the problem and then take steps to fix it. The major incident process relies on a set of clearly defined steps for addressing the problem.
An effective collaboration platform should also share data between systems so team members can access it and use it, whether they are used to working in a service desk, monitoring system, incident management system, or another tool. If team members must access information from a system they don’t ordinarily use, the process will slow. Experts suggest the following steps also be included in your major incident workflow:
- Identify the problem and determine if it meets your criteria for a major incident.
- Locate and meet with the Incident Response team.
- Have the team diagnose the cause of the problem.
- Notify all stakeholders of the situation and offer regular updates.
- Put temporary and/or permanent fixes in place.
- Resume business processes as soon as possible, starting with the most vital ones.
- Investigate the cause of the incident and implement preventative actions so the situation will not be repeated.
The main focus should be to limit the time that your employees and clients are impacted by the incident. The longer it continues, the more problems your company will experience as a result, including loss of revenue, reputation damage, and potential loss of clients. An organized response will do much to lessen the impact of the incident.
How to automate a major incident process
After the problem has been corrected and you’ve determined the cause and taken steps to prevent a recurrence, you also must deal with the potential fallout from clients and employees.
Depending on the incident’s severity and length, your company brand may also suffer. Customers and suppliers may be initially sympathetic, but if problems linger, you may easily lose both groups of stakeholders. You need a communication plan in place to address the effects of your major incident. You may also consider some small gesture, such as a discount or special perk, to regain trust and goodwill.
Your company experiences small glitches and issues every week if not every day. A major incident is one that severely compromises your company’s ability to function and, in some instances, shuts it down completely. You need a major incident process in place that includes a specially selected team. When a data disaster strikes, you will be able to address it immediately and minimize the damage to your workflow and your customer base. Rather than be mired in disarray, your employees will be able to take logical, meaningful steps to get your business back up and running more quickly.
Preserving the remediation steps and chat conversations in service desk and other systems will help with post-incident analysis, improve future responses, and even help prevent some major incidents. The process has to be matched with an infrastructure that supports it. If people critical to incident management can’t easily access information when they need it, the process will break down, or at least slow down. Integrations between systems enable your people to work in the systems they already use.