Reply to Message

Your storage system is not large enough
storagelunatic 22nd Oct 2008
The reason you do not see this problem is that you do not have enough disk drives to be statistically significant. 1-2TB is a couple of disk drives.

In a typical enterprise data center, there are hundreds to thousands of disk drives. All of the data centers I have worked with that have a decent number of disk drives do, in fact, see this problem. Hence, this is much ado about something.

I do think that the problem is exacerbated by the behavior of the RAID controller when it encounters a failure on a disk drive. When a RAID controller is running along and gets an uncorrectable read error on a single disk drive in a RAID set, many times it will simply shut that drive down and begin a rebuild operation on the hot spare. Now, enter the problem of the probability of a second read failure on one of the remaining drives in the RAID set. That second failure will cause the RAID controller to quite possibly give up.

IMHO this is far too aggressive. Some of the newer, more intelligent RAID controllers will take the first offending drive offline but not disable it entirely. Instead, the drive is examined for the root cause of the problem and either repaired and put back into service, or it is used in conjunction with the other remaining drives to perform a more robust rebuild operation. This assumes, of course, that the drive is accessible. If the drive is dead then you are back to the problem of a data error on a second drive causing problems in the rebuild. Even so, I think that the rebuild should complete as it would normally and report enough information back to the host through sense data that the data management people can determine the extent of the problem in terms of which files and/or metadata is affected and so on.

This stuff is not easy and I agree that the higher capacity drives are increasing the exposure to data errors. I think that we need to be engineering data storage systems that assume data errors are a normal event rather than an anomaly and deal with them more appropriately than we have been.
ie8 fix

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox