“Incidents” have an inherent negative undertone, and with good reason. While they technically refer to any event that happens (even positive or neutral), more often than not the word indicates that something has gone wrong that needs rectifying.
In the world of IT, what is an incident? How do you take care of them? How do you prevent them (or replicate them)? Why does it matter?
Today, we’ll explore the ins and outs of incident management systems, discuss why you (probably) need one, and give you all the necessary information to make the right decision for your enterprise when it comes time to choose.
In this context, an incident is defined as “an unplanned interruption to an IT service, or reduction in the quality of an IT service, or a failure of a CI that has not yet impacted an IT service.” Incident management, therefore, is the process that is responsible for managing the life cycle of all incidents.
Incident management activities involve working toward restoring regular operations or resolving a specific type of incident. The goal here is for the IT team to return a service to normal “as quickly as possible after a disruption, in a way that aims to create as little negative impact on the business as possible.”
Essentially, an incident management system (or incident management software — IMS) decides who gets alerted about incidents and when.
There are many terms within this space that we won’t dive into today, so check out this OpsGenie blog for more on the language of these systems.
The Role of Incident Management Systems
With an increasing number of services being considered “always on,” it is becoming even more crucial for IT teams to have incident management systems in place to ensure they can stay in control during incidents (and respond effectively).
Having the ability to plan ahead and prepare for the incidents that will inevitably occur is essential for effective operations. When they do occur, it is important that alerts are never missed and that the right people are notified. After the incident, IT teams need the ability to analyze response activities and identify areas for improvement. And, of course, any part of the process that can be automated creates synergies across the board and saves the IT team valuable time.
With the term “incident” covering such a wide range of potential events, incident management systems are also inclusive of many different operations. Below are some examples.
Management of on-call rotas — Just like hospitals need doctors on call, IT organizations have on-call employees to fix issues for software services as they arise.
Escalations — This term can mean different things to different people, but it all comes down to reassigning an incident to someone else. This could mean assigning an incident to a more expert team (or third party supplier), adjusting the priority of the incident (usually upwards), or changing the Incident and alerting staff as it becomes possible the resolution will be late.
Internal teams — Responding quickly and effectively to incidents requires a solid IT team that is on top of their game. Incident management systems help enterprises keep their internal teams organised so that there is a clear process laid out to tackle any incident that may arise.
Virtual incident “situation rooms” and communications — A relatively new feature of an IMS is the ability to set up a virtual “situation room” for an incident. The system is responsible for inviting the required attendees, sharing any relevant documentation and collateral, providing an accurate communication history, and tracking tasks and actions.
Communications with 3rd parties — As mentioned above, sometimes activities within incident management (like escalations) require communications between the IT team and 3rd parties (e.g. an SME from a service or application provider). A good incident management system should be able to bridge these gaps and handle all communications to ensure no wires get crossed during those critical moments.
Relationship with status pages — Increasingly, IT teams are utilizing status pages to keep their customers, users and employees in the loop about outages, system metrics and statuses, as well as planned maintenance. In the case of incident management, status pages can serve as a channel for public communications regarding any incidents that may occur.
This list is by no means all-inclusive, but it paints a good picture of some of the areas to consider when selecting an incident management system. As you can see, an IMS can help IT teams to coordinate the various streams and activities, stay up-to-date on what is happening inside their infrastructure and to quickly address any issues using the proper process and channels.
Modern IMS & Anomaly Detection
These software systems have always been designed for collecting consistent, time sensitive, documented incident report data. Luckily for everyone in IT, though, IMS has come a long way, and modern products have become even more advanced.
First, they often provide administrators with the ability to “configure the Incident report forms as needed, create analysis reports, and set access controls on the data.” Incident reports are often customizable to better suit the needs of the specific organizations using the systems, and save time producing post-incident reports and documentation. Additionally, some of these products also have the ability to collect images, video, audio and other data.
No modern system would be truly keeping up with technology if it did not utilize machine learning and AI to continuously improve upon itself, and IMS is no exception. For example, let’s take a look at automated anomaly detection and alerting.
It is not difficult to see why finding anomalies and notifying the proper people of their existence is a business essential. However, have you stopped to think about the time and resources that would be required to manually handle anomaly detection and alerting? While this may be possible to do on a very small scale, it is not a viable option when you consider the amount of data generated by most modern enterprises (especially those who consider themselves “always on”).
In business computing, “anomalous information must be quickly recognized in order to take appropriate action, addressing both risks and rewards quickly and accurately.”
That means leveraging artificial intelligence (AI) and its offspring, machine learning (ML).
With AI and ML, incident management systems are able to scale up to a number of metrics that simply would not be possible with manual anomaly detection (unless you were to employ hundreds of thousands of analysts — no big deal, right?).
Due to technological advancements in recent years, companies have more data to work with than ever before, and ML has become a vital piece of helping them to sift through the noise and find the data that needs their attention (like anomalies). “Business metrics can be constantly analyzed against goals, incidents requiring actions can be flagged in real-time, and the system can find unexpected anomalies in order to rapidly help business adapt to changing conditions.”
Do you need one?
If you are even considering the question of whether you need an incident management system, the answer is probably “yes.” IT teams are being forced to deal with more data, requests, and incidents than ever before, simply because the world has become so technologically connected and advanced. Especially in the case of services that are “always on,” Incident management software can therefore be a life-saver for your enterprise.
If you are looking for some great IMS options to consider, check out:
- VictorOps (recently acquired by Splunk)
- OnPage (specialising in healthcare, HIPAA-compliance)
All of these options can help DevOps teams to plan ahead for service disruptions and stay in control during incidents.
Even though “incidents” may have a negative tone (and cause IT teams to cringe), they do not have to interrupt or shut down your operations. By implementing an incident management system, you empower your enterprise to prepare for the inevitable incidents that will come — and ensure the team can swiftly and effectively remedy any situation.
As always, if you have any questions or comments regarding this piece or the OpsMatters platform, please leave a comment below or reach out to us at firstname.lastname@example.org.