Disaster Recovery Report - Quorum's view of causes of IT failures

Disaster Recovery Report - Quorum's view of causes of IT failures

Summary: Quorum reviewed the records of its own call center and produced a report stating the primary causes of IT workload failure. While very interesting, the results cannot be considered representative of the market as a whole.

SHARE:
3

Larry Lang, CEO of Quorum, recently took the time to run through his company's Disaster Recovery Report, Quarter 1 2013. Since I've often commented on surveys, the good, the bad and the really ugly, I thought I'd take the time to comment on Quorum's report.

The sample

One of the biggest issues I have with most surveys is that the sample doesn't represent the market as a whole. More often, the survey respondents represent the attendees of a company's own event.

To compound the problem, the limited sample is analyzed and the results are presented as if they represent the entire worldwide market. The result is that the survey results can be seen as self-serving and only marginally useful when it comes to learning more about the industry as a whole.

Quorum is up front with the comment that the report comes from the careful analysis of its own call center's data. So, the results can, at best, be seen as representing Quorum's own installed base rather than shining a light on the thinking of the industry's decision-makers.

Here's how Quorum describes the data:

Quorum derived statistics from incoming calls in its IT support center, representing a cross-section of Quorum's hundreds of customers. Quorum's customers are small- to medium-sized businesses that span a wide variety of industries in the United States, EMEA, and Asia/Pacific.

It is clear that the findings must be considered indicative of Quorum's own customers and not necessarily representative of the market as a whole.

Summary of Quorum's findings

Quorum's analysis of its call-center data led the company to present the following information. The top causes of failure are:

  • 55% hardware failure
  • 22% human error
  • 18% software failure
  • 5% natural disasters

Quorum went on to review the ways most companies prepare for disasters including the following:

  • Tape and Disk backup — Traditional approach to disaster planning. Quorum cites the fact that setting up this type of backup can be complex and it may be difficult to recover entire distributed, multi-tier, multi-site workloads using this method.
  • Cloud backup — An up-and-coming approach. While this method appears appealing, Quorum says, it may actually increase recovery time rather than reducing the time it takes to return to normal operations.
  • Hybrid cloud backup — The combination of the traditional tape/disk backup with cloud backup. Quorum points out that this makes it possible to keep an up-to-date image of what's executing. Furthermore, Quorum states, it would be possible to immediately return to operations in a cloud environment.

Quorum's recommendations

Quorum's conclusion is that organizations are best served by setting up a "continuous back-up process" that relies on moment-by-moment snapshots kept in the cloud.

Snapshot Analysis

I've read quite a number of studies that focused on causes of disasters and suggested approaches to disaster planning. While I was with IDC, I worked with the team that conducted research in this area.

Those studies often showed that the human element was a much larger percentage of the causes of IT failure. Hardware and software problems were responsible for a much smaller segment of these failures. That being said, Quorum's customer base might have better administrative tools and processes than the market as a whole and so the results would be skewed towards system or system software failures.

In my view, Quorum is right to suggest that having a disaster plan and tools in place to constantly monitor workload execution would turn most "disasters" into momentary irritations rather than events that put companies at risk.

If you're interested in reading the report, please visit Quorum's website for more information.

Topic: Disaster Recovery

About

Daniel Kusnetzky, a reformed software engineer and product manager, founded Kusnetzky Group LLC in 2006. He's literally written the book on virtualization and often comments on cloud computing, mobility and systems software. In his spare time, he's also the managing partner of Lux Sonus LLC, an investment firm.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

3 comments
Log in or register to join the discussion
  • Statistics

    Thanks for the thoughtful commentary on the Quorum study! As the son of a science professor, I certainly agree that the statistics offered are not exact. That said, we think they're interesting because they come from empirical observations in our support center, rather than a survey of opinions and recollections, subject to cognitive bias.

    The analysis arose from the aftermath of Hurricane Sandy, which prompted much discussion of disaster recovery. Our intuition suggested that, while such natural disasters grab headlines, mundane problems like hard disk failure or firmware errors are much more likely to cause issues. The case records bear that out, by a factor of almost 20 times.

    I also suspect that human error is under-reported in these statistics. If a customer tells Quorum he needs help recovering from a hardware failure, we don't press the issue to discover if it was actually precipitated by spilling coffee into the server! :-)
    llang629
    • Human error is often under reported

      I agree that human error is often under reported. Managers seldom want to say that someone on their team made a blunder or misstep that cost the company a great deal of many. There are a few organizations that get data center and IT managers to talk about their mistakes and what they've learned from them. One of them is the site uptime network which is managed by the uptime institute which is associated with my former employer the 451 group.

      Dan K
      dkusnetzky
      • DR Testing & Cloud-Based DR

        Human error can be avoided with better planning and full DR testing (as well as staff training). Testing with traditional disaster recovery can be time-consuming and disruptive of production environments, while cloud-based disaster recovery is much faster. Learn more in: http://www.onlinetech.com/resources/white-papers/disaster-recovery
        onlinetech