Yesterday the most used social media platforms in the world were inaccessible for 6 hours straight. Later, in a press release, Facebook revealed that the outage was due to configuration changes in their routers. There is no doubt that Facebook has an intense incident response plan, yet a small blind spot resulted in a significant business interruption. So how do we avoid this? The truth is, outages and performance issues are bound to happen in any network.
For retailers, uptime is money and issues can cost thousands of dollars per minute. With infrastructure comprising complex services such as payment gateways, inventory, and mobile applications, maturing digital operations is vital for ensuring services are always on and customers get the best experience.
2019 and 2020 were worlds apart. Our entire ways of working, living, socializing, and learning were changed almost overnight. Over the last 18 months, technical teams have had to double down on all their digital efforts to help their customers adapt to the new normal. At the same time, teams were responsible for more unplanned work than ever as incidents steadily rose. For the first time, we’ve created the State of Digital Operations Report which is based on PagerDuty platform data.
Should all incidents be treated the same? Seems like a simple question, but the answer can have big implications. Think about an employee who contacts the service desk, complaining they can’t log onto their email. If the issue is due to a ‘stale’ password, dropped connection or configuration issue after an update for the email server, then the impact on the organization can be quantified to the lost productivity for the impacted employee or employees.