Disaster Recovery Report - Quorum's view of causes of IT failures

Quorum reviewed the records of its own call center and produced a report stating the primary causes of IT workload failure. While very interesting, the results cannot be considered representative of the market as a whole.

Larry Lang, CEO of Quorum, recently took the time to run through his company's Disaster Recovery Report, Quarter 1 2013. Since I've often commented on surveys, the good, the bad and the really ugly, I thought I'd take the time to comment on Quorum's report.

The sample

One of the biggest issues I have with most surveys is that the sample doesn't represent the market as a whole. More often, the survey respondents represent the attendees of a company's own event.

To compound the problem, the limited sample is analyzed and the results are presented as if they represent the entire worldwide market. The result is that the survey results can be seen as self-serving and only marginally useful when it comes to learning more about the industry as a whole.

Quorum is up front with the comment that the report comes from the careful analysis of its own call center's data. So, the results can, at best, be seen as representing Quorum's own installed base rather than shining a light on the thinking of the industry's decision-makers.

Here's how Quorum describes the data:

Quorum derived statistics from incoming calls in its IT support center, representing a cross-section of Quorum's hundreds of customers. Quorum's customers are small- to medium-sized businesses that span a wide variety of industries in the United States, EMEA, and Asia/Pacific.

It is clear that the findings must be considered indicative of Quorum's own customers and not necessarily representative of the market as a whole.

Summary of Quorum's findings

Quorum's analysis of its call-center data led the company to present the following information. The top causes of failure are:

  • 55% hardware failure
  • 22% human error
  • 18% software failure
  • 5% natural disasters

Quorum went on to review the ways most companies prepare for disasters including the following:

  • Tape and Disk backup — Traditional approach to disaster planning. Quorum cites the fact that setting up this type of backup can be complex and it may be difficult to recover entire distributed, multi-tier, multi-site workloads using this method.
  • Cloud backup — An up-and-coming approach. While this method appears appealing, Quorum says, it may actually increase recovery time rather than reducing the time it takes to return to normal operations.
  • Hybrid cloud backup — The combination of the traditional tape/disk backup with cloud backup. Quorum points out that this makes it possible to keep an up-to-date image of what's executing. Furthermore, Quorum states, it would be possible to immediately return to operations in a cloud environment.

Quorum's recommendations

Quorum's conclusion is that organizations are best served by setting up a "continuous back-up process" that relies on moment-by-moment snapshots kept in the cloud.

Snapshot Analysis

I've read quite a number of studies that focused on causes of disasters and suggested approaches to disaster planning. While I was with IDC, I worked with the team that conducted research in this area.

Those studies often showed that the human element was a much larger percentage of the causes of IT failure. Hardware and software problems were responsible for a much smaller segment of these failures. That being said, Quorum's customer base might have better administrative tools and processes than the market as a whole and so the results would be skewed towards system or system software failures.

In my view, Quorum is right to suggest that having a disaster plan and tools in place to constantly monitor workload execution would turn most "disasters" into momentary irritations rather than events that put companies at risk.

If you're interested in reading the report, please visit Quorum's website for more information.