Why RAID 5 still works, usually

In 2007 I predicted that RAID 5 would cease to work to prevent data loss. And yet, as 2017 dawns, many people are using RAID 5 successfully. No, I wasn't wrong. Here's what changed - and how to save your data when a RAID 5 drive dies.

dataerosionisnatural-3797.jpg

Data erosion - and death - is natural.

Robin Harris

How does RAID 5 reduce data loss?

RAID 5 takes your data and adds some parity data that makes it possible to reconstruct the original data if there is a drive failure (RAID 6 is similar, except it can reconstruct after two failures). So why would it stop working?

The URE problem

special feature

The Evolution of Enterprise Storage

How to plan, manage, and optimize enterprise storage to keep up with the data deluge.

Read More

RAID 5 works fine when there are no further failures or errors during data reconstruction. Back in 2007 though, almost all SATA drives, and many SCSI drives, were spec'd with one Unrecoverable Read Error (URE) at 10^14. That's one URE every 12.5TB.

One terabyte drives were coming into production then. If you had an 8 drive RAID 5 stripe, and one drive failed, the RAID controller would have to read 7TB of data to reconstruct the failed drive.

That meant a better than 50 percent chance that during the reconstruction a URE would scuttle the entire process. When that happens it would have been faster to use a backup to rebuild the data.

Of course, drives have only gotten bigger. Four terabyte drives are common and we now have 10TB drives.

Why do we still rebuild RAID drives? | Storage looks inward: Today's action is inside the server, not out on the SAN | Disk drive reliability: What we've learned from a billion hours | How to really erase any drive -- even SSDs

Why does RAID 5 still work?

Simple: drive vendors up'd the spec - for some drives - to one URE in 10^15 bits, or about 125TB. Of course, now that drive capacities have also increased by 10x, the problem of failure due to a URE during reconstruction is coming back.

Seagate's Barracuda Pro and WD's Gold datacenter drives are spec'd to less than 1 URE in 10^15 bits read.

However, many other large capacity drives aren't at the higher spec. If you use a low-spec drive in a RAID, there's a good chance the rebuild won't work.

It pays to look at spec sheets if you have critical applications or data. Or you can do what I do.

The Storage Bits take

I have a couple of 4 drive RAID 5 arrays. I don't worry about the URE problem because I have all the critical data backed up to the cloud.

In case of a drive failure your first action should be to copy all data from the array before replacing the failed drive. If you encounter a URE during copying, at least you've saved all the other data. Not all low-cost RAID controllers report read errors, so you might copy a corrupted file, but that would have happened anyway.

This reiterates the core premise of RAID: it provides data access after drive failures and is NOT a substitute for backup. Fortunately, hard drives are getting more reliable, so your chance of needing this advice is declining.

But as drive capacities continue to rise, vendors need to raise their URE spec. When will they do it?

Courteous comments welcome, of course.

Newsletters

You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
See All
See All