Cyber, incident, downtime: Three words that chill the board, and how to tame them

Cyber, incident, downtime: Three words that chill the board, and how to tame them

There are three words that every member around a boardroom table fears when they hear them strung together: "Cyber... incident... downtime". They are never the precursor to a good meeting!

Technology incidents can leave the business in the dark and bring the wheels of industry grinding to a halt. With no operational systems, a Gartner report found that companies can lose up to half a million dollars per hour from severe incidents based on losses and remediation.

As the UK moves into what is forecast to be a long recession with a new Prime Minister, continuing climate, energy, and geopolitical troubles, firms are looking for what constancy they can get, to better plan for survival, and maybe even a little growth, should it be possible.

Setting the right stance for success means putting the organisation into a proactive digital operations mode — being operationally mature enough to manage the complexities and challenges created by cyber adversaries, 'acts of god', or old-fashioned mistakes and technology failures.

Operational maturity tames the beast

Operational maturity plays a crucial role in how well teams are able to handle incident response and unplanned work – those beasts of downtime. PagerDuty's 2022 State of Digital Operations Report uncovered, from real-world use cases from technical teams within organisations, that 42% of participants were working more hours in 2021 than in 2020. What's more, across all industries, 54% of responders were being interrupted outside of normal working hours. A lack of operational maturity is one cause behind a reliance on break-fix and manual intervention.

There's a ladder of stages on the way to full operational maturity, that progresses like this:

  • Stage one: MANUAL. No inbound integrations (incidents are initiated manually).
  • Stage two: REACTIVE. Has some inbound integrations but not other configurations, no defined processes for managing incidents.
  • Stage three: RESPONSIVE. Has defined call-out schedules and multiple escalation levels; teams moving towards full service ownership.
  • Stage four: PROACTIVE. Uses outbound integrations, service dependencies, change events, and response plays to fix issues before customers are aware.
  • Stage five: PREVENTATIVE. Adopts event intelligence features or consumes analytics to allow predictive remediation.

At each upward stage of this ladder, the organisation becomes better able to manage its unplanned incidents, and more and more able to better manage the time and resources needed to schedule work to improve the stability of its operational systems. This operational maturity model takes into account two important factors that reduce the risks from cybersecurity events, incidents and downtime.

  1. Responsiveness. The right training, processes, and solutions, such as a real-time operations platform, helps organisations prioritise and manage their urgent work. That means responding faster to incidents – as well as incipient threats. In fact, by integrating incident management and SecOps, security teams can leverage the same platforms that development and operations teams use which improves cross-team visibility and reduces collaboration friction.

  2. Proactiveness. Operational maturity and investment in modern practices leads to better response times, along with a host of other, allied benefits. For example, consistency and managing workloads in working hours means a more even distribution of work across technical teammates, more consistent working hours, and lowered attrition and burnout rates. And those are the downstream benefits. Upstream, reliability is improved, and with greater control of the digital environment, costs of remediation of unexpected events are reduced, service quality is delivered as planned, and revenue and customer satisfaction are maintained.

What mature digital operations achieve

  • The business can respond promptly to any and all critical issues.

It shouldn't matter if the organisation is looking at incidents arising from COVID-19, a run on the bank, or the collapse of a supplier. The organisation must have access to the right resources and support for events. Impacted staff must also be kept up-to-date with the information they need.

  • Mitigate disruptions to manufacturing or supply chains.

A central team must be able to react to any gaps in the supply chain to maintain delivery and provide updates to stakeholders.

  • Manage critical issues from wherever.

Technical and leadership teams must have remote-first, distributed crisis management, and the tools to be able to orchestrate a response on any platform. What crisis ever unrolls in cooperation with procedures?

  • Use common platforms for better communication

Engineers across software development and operations must be open to evolving real-time security events. The organisation should have awareness across its environment of operational alerts and active incidents.

  • Spend more time innovating!

Operational maturity, DevOps and full-service ownership offer greater accountability and control. Automated capabilities can quickly and accurately mobilise the right teams to manage operations – planned or unplanned. Machine learning techniques filter out noise, and pull in the right people when they are needed.

All in all, proactive, preventative, and mature digital operations put an organisation in the best place to, first, not hear those dreaded words, and second, better manage incidents should they arise. Good for the board to know and measure, good for the business to understand and invest in. Three little words that can be replaced with three more: "All systems operational".