RAID 5: theory & reality

RAID 5 pain is on message boards and support forums all over the net. Failed rebuilds, lost data, unhappy bosses. Why isn't RAID 5 as reliable as it is supposed to be?
Written by Robin Harris, Contributor

In theory, RAID 5 protects your data. In reality, RAID 5 is often a painful failure. Why? Mean time-to-data-loss (MTTDL) is a fraud: actual rates of double-disk failures are 2 to 1500 times higher than MTTDL predicts.

What's behind MTTDL's failure? In A Highly Accurate Method for Assessing Reliability of RAID researchers Jon G. Elerath of NetApp - a major storage vendor - and Prof. Michael Pecht of the University of Maryland, compared RAID theory against actual field data. They found that MTTDL calculations inaccurate for 3 reasons:

  • Errors in statistical theory of repairable systems.
  • Incomplete consideration of failure modes.
  • Inaccurate time-to-failure distributions.

By repairing MTTDL's theoretical basis, adding real-world failure data and using Monte Carlo simulations they found that today's MTTDL estimates are wildly optimistic. Which means your data is a lot less safe with RAID 5 than you know.

Repairable systems The typical MTTDL assumption is that once repaired - i.e. a disk is replaced with a new one - the RAID system is as good as new. But this isn't true: at best, the system is only slightly better than it was right before the failure.

One component is new - but the rest are as old and worn as they were before the failure - so the system is not "like new." The system is no less likely to fail after the repair than it was before.

The problem is that in RAID arrays repairs take time: the disk fails; a hot spare or a new disk is added; and the data rebuild starts - a process that can take hours or days - while the other components continue to age.

Net net: MTTDL calculations use the wrong failure distributions and incorrectly correlate component and system failures.

Failure modes MTTDL typically considers only catastrophic disk failures. But I've noted [see Why RAID 5 stops working in 2009, RAIDfail: Don't use RAID 5 on small arrays and Why disks can't read - or write] disks have latent errors as well. A catastrophic failure + latent error is a dual-disk failure, something RAID 5 can't handle.

Anatomy of a RAID failure There are 4 transition events in a RAID 5 failure:

  • Time to operational failure. Drive failure distributions are not constant. Sub-populations of drives may have specific failure modes, like infant mortality, that MTTDL models do not account for.
  • Time to restore. Minimum restore times are functions of several variables, including HDD capacity, HDD data rate, data bus bandwidth, number of HDDs on the bus and the on-going I/O load on the array. A 2 TB drive might take 40 hours or more to restore.
  • Time to latent defect. Latent defect rates vary with usage, age and drive technology.
  • Time to scrub. Scrubbing is a background process meant to find and repair latent errors. Busy systems have less time to scrub which increases the chance of a latent error hosing a RAID 5 rebuild. Scrub strategy has a major impact on latent error rates.

Using field-validated distributions for these 4 transition events and Monte Carlo simulations, the researchers concluded:

The model results show that including time-dependent failure rates and restoration rates along with latent defects yields estimates of [dual-disk failures] that are as much as 4,000 times greater than the MTTDL-based estimates.

Which is why RAID 5 has caused so much trouble to so many people over the last 20 years.

The Storage Bits take As a practical matter, don't rely on 4 SATA drive RAID 5 to protect your data. The chance of a latent error and a hosed rebuild are too great - much greater than the product's engineers probably believe.

If you must use a RAID array - and I don't recommend it for home users unless you have system admin skills - make sure it protects against 2 disk failures (RAID 6 or equivalent). That means a 5 drive array at a minimum.

But there is a larger pattern here. Disk drives have a higher failure rate than vendors spec. DRAM also has a much higher error rate than commonly believed. And file systems are also flakier than they should be.

Weird, huh? Every critical element of data integrity turns out to be much worse than commonly thought.

This isn't a conspiracy so much as a natural vendor reluctance to give out bad news about their products. That's why we need independent observers to check out product claims.

But the bigger issue with storage is that the Universe hates your data. If there's a failure mode hidden somewhere, the Universe will find it and use it.

Long term data integrity on a massive scale will require a re-tooling of vendor development and test. Detroit did that with statistical process control over the last 30 years and massively improved quality as a result.

The current piecemeal approach to mending subsystems needs to give way to a complete end-to-end systems design for data integrity. But the reality of still-rapidly-evolving storage technologies probably puts that effort at least 2 decades away.

In the interim remember that your data needs protection. Let's be careful out there.

Comments welcome, of course. I've done work for NetApp and admire the good works of co-founder Dave Hitz.

Editorial standards