The astonishing hidden and personal costs of IT downtime (and how predictive analytics might help)

You've heard about the big IT failures like the British Airways shutdown at Heathrow this week. But there are hidden factors that can sap productivity and kill innovation. To learn more, and how AI might help, read on.
Written by David Gewirtz, Senior Contributing Editor

You may have seen the commercial. It's all over YouTube. A repair guy walks up to the front desk in an office building. "I'm here to fix the elevator," he says. The guy behind the main desk responds, "Nothing's wrong with the elevator."

Cue the bouncy music and the neon icon for IBM's Watson. In a friendly voice, Watson says, "My analysis of sensor and maintenance data indicates that Elevator 3 will malfunction in two days."

But systems failures don't just hit elevators. They're everywhere. Of concern to us are the failures that impact IT operations. This problem is front-and-center in the news right now, because of the huge systems failure at British Airways over the holiday weekend.

Apparently, a power system in the company's data center failed. This resulted in cancelled flights for more than 75,000 passengers, about $68 million in passenger reimbursement costs (not including hotel costs), and a 2.8 percent stock price drop of parent company, IAG.

British Airways isn't the only big operator to be hit with catastrophic IT failures. A Financial Times rundown mentions a 2016 failure at Delta Air Lines, which resulted in the cancellation of 2,300 flights, delaying hundreds of thousands of passengers.

In 2015, thousands of banking customers couldn't cash their paychecks, right before a holiday weekend, because the bank HSBC screwed up a software update.

Another 2015 bank failure, this time at the Royal Bank of Scotland, resulted in 600,000 customer payments and withdrawals going "missing." Making matters worse, this same bank had been fined 56 million pounds (about $72 million) after a 2012 failure hit six million customers.

Continuing our 2015 nightmare rundown was the New York Stock Exchange, where trading was suspended for four hours after an upgrade failed. This one cost about $10 million (or at least that's how much NYSE's parent company Intercontinental Exchange set aside to deal with the ensuing SEC investigation).

The cost of IT downtime

That will do for now. The point is, failures occur. According to Gartner, the average cost of IT downtime is $5,600 per minute. Because there are so many differences in how businesses operate, the Gartner analyst, Andrew Lerner, states that downtime, at the low end, can be as much as $140,000 per hour, $300,000 per hour on average, and as much as $540,000 per hour at the higher end.

As we saw from the NYSE example above, the four hour downtime at the stock exchange will result in costs of at least $2.5 million per hour.

The cost of interruptions

But there are other costs that don't often show up in the headlines. That's the cost of interruptions, especially when IT professionals are interrupted from what might be more productive work.

Take, for example, the interruption that occurs when someone pops into your office to tell you that your email server is down. That interruption, of course, takes the time it takes, plus the time to fix the problem. But did you know, according to a study by UC Irvine, that it often takes an average of 23 minutes to refocus and get your head back in the game after an interruption?

There's never just one interruption in a day. Later, let's say that another co-worker pops in to tell you that no one from your Northwest office can get into the CRM system. According to a Carnegie Melon University study, cognitive function can decrease by 20 percent after an interruption. Yep, you can go from an A-student level to a C-student level, merely because you were distracted by yet another problem.


Interruptions can sap a tremendous amount of time.

A recent study, reported in the Washington Post, discussed interruptions in the financial industry. Think about these numbers. Interruptions consume, on average, 238 minutes per day. In addition, the time to get started back up after an interruption consumes another 84 minutes a day. The time lost to stress and fatigue steals another 50 minutes a day.

You'll need to sit down for this. All that adds up to about 6.2 hours per day, or 31 hours per week lost to interruptions. Is it any wonder we're spending most of our time treading water, and innovation seems to have come to a complete standstill?

The innovation gap

IT should be driving the corporate mission forward, but that's not normally the case. Forrester did a study where it looked at the amount of IT budget devoted to what it calls Tech MOOSE vs. advancing the company's mission. MOOSE stands for Maintain and Operate the Organization, Systems, and Equipment -- basically ongoing maintenance and operations.

According to Forrester analyst Andrew Bartels, 69 percent of IT budgets go to MOOSE maintenance. Only 31 percent of IT budgets are allocated to new projects. That may not seem all that bad until you realize that only 14 percent of that 31 percent (or about 4.3 percent of the overall IT budget) goes to so-called "sell side" investments: investments in customer-facing opportunities and new business.

There must be a better way

You'll notice that I haven't even mentioned the cost of cybersecurity problems. I'm leaving that entire cluster of nightmares out of this discussion to help crystalize the non-attack vector IT maintenance issue.

The 10 scariest cloud outages (and lessons learned from them)

We've showcased how some IT failures have cost millions of dollars. We've explored how analysts estimate that average failures cost hundreds of thousands of dollars an hour. We've learned how interruptions can cost us most of our productive time, and sap us cognitively. Finally, we've seen how almost none of the typical IT budget goes to increasing sales and improving the customer experience.

That's where we come back to the elevator commercial I mentioned at the beginning of this article. IBM is advertising Watson, its AI-based system for predictive analytics. Watson is far from the only system doing predictive analytics. That's because the idea of predicting what might happen is so powerful.

You're never going to eliminate all interruptions and failures. But many IT disasters can be predicted. Based on an ever-growing knowledge base, the ability to add telemetry into systems and virtual machines, the ability to maintain huge, big-data data sets of historical performance information, and the growth in the science of machine learning, it is possible that we can predict many of the problems that would normally knock us back.

If you look at the numbers I've taken you through, you can see that even if just a small number of events are predicted ahead of time, because the scale of the losses is so huge otherwise, there's no doubt getting ahead of the ball on even a few events would have a positive impact.

We'll be looking a lot more at how big data and predictive analytics can help with IT maintenance here at ZDNet. In fact, I'm having a discussion tomorrow with a vendor who has been looking at this problem, particularly when it comes to flash storage. The company, Nimble Storage, is working with the idea that if they can see patterns in storage behavior, those patterns can predict issues elsewhere in the IT stack.

Please join Nimble's David Wang and me in "Deep learning instead of deep trouble with predictive flash," a live and interactive webcast at 2pm tomorrow here on ZDNet about how your company can use powerful analytics together with fast storage to get ahead of problems across your entire infrastructure.

You can follow my day-to-day project updates on social media. Be sure to follow me on Twitter at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.

Editorial standards