Sorry about your broken RAID 5

You didn't know?I'm sorry I was the one to tell you that RAID 5 is broken today and will be well and truly broken in 2009 (see Why RAID 5 stops working in 2009), but somebody had to do it.

You didn't know? I'm sorry I was the one to tell you that RAID 5 is broken today and will be well and truly broken in 2009 (see Why RAID 5 stops working in 2009), but somebody had to do it. The good news is that the industry is ahead of you developing solutions.

I found the negative response to my last post on the unrecoverable read error (URE) issue fascinating. A number of informed people commented, correcting my math - I took 2 statistics courses in grad school, but that was a long time ago - and taking issue with some of my arguments. All good.

What was interesting to me was that my post didn't say anything that people in the industry haven't known for years. For example, this Intel white paper published last year:

Intelligent RAID 6 Theory Overview And Implementation

RAID 5 systems are commonly deployed for data protection in most business environments. However, RAID 5 systems only tolerate a single drive failure, and the probability of encountering latent defects [i.e. UREs, among other problems] of drives approaches 100 percent as disk capacity and array width increase.

Every engineer in the RAID business knows this. So a) why don't technically-oriented ZDnet readers and b) why the emotional response to a statistical argument grounded on drive vendor's own specs?

Misplaced faith in RAID Beyond the issues with my communication skills I saw several themes:

  • My RAID works great (and therefore always will?)
  • Sensationalism, hype and I don't believe you. La-la-la-la-la!
  • Power factors always surprise people.

It reminded me of a comment from a SOHO/SMB RAID designer a few months back:

I was a big proponent of RAID until I found that our customers were placing so much faith in RAID that they were putting all their data on the NAS and then _deleting_ it from ALL other locations. In many cases, they had no off-site storage strategy for their data.

Array vendors take this seriously Regular readers know I'm not a fan of the array vendors. I'm critical of an architecture where the raw disk capacity comprises only 10% of the cost of a "solution." I believe there are better ways to protect data economically.

Yet industry engineers do take data availability and integrity very seriously. They see most problems well before customers because they are working with the largest population of equipment.

That's why almost every vendor offers some version of RAID 6 to protect against double errors. Even with enterprise disks whose smaller capacity and 10^15 error rate make data loss from a disk failure + URE much less likely (10^15 is 1 URE every 125 TB). RAID 6 is often recommended because in mission-critical environments even a 1% chance of an array read error after a disk failure is often too great.

The industry isn't stopping there Some other initiatives include

  • 4K sectors - Drive vendors have been lobbying OS vendors for years to raise the block size from 512 bytes to 4KB, which enables more robust ECC without a big capacity hit. Word is that Microsoft is might actually, maybe, do it. Next time you see Ballmer, ask him about it. Why wait for Apple to do it first?
  • Many arrays do background sector scrubbing, looking for sectors with currently recoverable read errors and either rewriting and/or removing them before they cause a problem.
  • NAS boxes that virtualize disks as a pool of blocks can combine their file system knowledge to enable data redundancy on a per-file basis for greater availability. A URE on an unused block isn't a problem since the NAS file system knows what blocks are in use and which aren't.
  • Advanced file systems like ZFS, which combine file system and volume management functionality, can combine their parity data with parent-block checksums to perform ". . . combinational reconstruction of a RAID set." (Thanks, Joerg!)

That list just scratches the surface of all the work the industry is doing to ensure data availability and integrity as disk drives continue their capacity growth. RAID 5 is reaching its end of life, but your data can still be safe despite that.

Comments welcome, as always. Industry folks, what else is happening to manage this issue>