What your disk drive isn't telling you

because it's clueless. You know you've got a problem when your disk drive goes ka-thunk.
Written by Robin Harris, Contributor

because it's clueless.

You know you've got a problem when your disk drive goes ka-thunk. A study of 1.53 million disks finds that data errors are much more common than outright failures. You just don't know it. What's worse, neither do the people who design PC file systems.

A different kind of latency Unreported or latent disk errors are real. Storage array vendors have stopped recommending RAID 5 with SATA drives because of the very good chance you won't get your data back.

But until Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy and Jiri Schindler analyzed the error logs of over 50,000 systems, no one had done a large-scale study of the problem before. Lakshmi was at the U of Wisconsin-Madison while the latter 3 work at the large NAS vendor Network Appliance. They published An Analysis of Latent Sector Errors in Disk Drives last year.

Disks have a lot of error conditions, most of them - thankfully - transient. This study focused on Latent Sector Errors (LSE) which are defined as Bad News:

This error occurs when a particular disk sector cannot be read or written, or when there is an uncorrectable ECC error. Any data previously stored in the sector is lost.

[emphasis added]

Results Since most ZDnet readers are running PATA or SATA drives, I'll focus on the team's results for what they call - in apparent deference to NetApp marketing - nearline drives, as opposed to the costly enterprise drives used in high-end arrays. For me and you, nearline or consumer drives are the online drives that we rely on every day.

8.5% of all the consumer drives developed LSE. That's the good news.

The team found several factors that contribute to LSE.

  • Size matters. As disk size increases, so does the fraction of disks with LSE.
  • Age matters. 20% of some consumer disks had LSE after 24 months. LSE rates climbed with age.
  • Vendor matters. They also found that some vendors had much higher LSE than others. Due to the industry omerta they won't rat out the offenders, but you can bet NetApp isn't buying their disks.
  • Errors matter. A drive that develops one error is much more likely to develop a second.

Consumer/SOHO users with large, cheap, old disks will see LSE. Another reason Desktop RAID is a bad idea.

Implications for PC file systems File systems rely on disk-based data structures to keep track of your stuff. One of the key findings of the team is that disk errors tend to congregate near each other, like congressmen and lobbyists.

After the first LSE, a second LSE is also much more likely. LSE isn't random in time or space.

Therefore, file systems that replicate critical data across the disk are much less likely to lose your data than those, like the Linux ReiserFS, place critical structures in one contiguous area. Perhaps someone with specific knowledge of how NTFS and HFS+ lay out their data structures could comment.

The Storage Bits take We all like big cheap drives, but this study shows they come with some trade-offs. This data isn't causing me to give mine up.

What I am doing is backing up every night to a bootable external drive. If you aren't backing up now, I hope you'll start soon.

Update: if you are a home user, is there anything you should differently? Yes.

  • Back up your data. Disks are amazingly reliable, but they do fail. Be prepared.
  • Replacing disks when they turn 3 looks like a good idea if unplanned downtime would cost you money. I have a backup computer system for that very reason. No computer = no income. So I take this stuff seriously.
  • Don't use desktop RAID 5. If a drive fails and you encounter an LSE on the rebuilding drives you have to go to your backup anyway. You don't need the hassle.

I beat on my machine hard, using dozens of programs a week and creating thousands of files, so I use an OS X disk repair utility every couple of months to rebuild my directory. I'm amazed at how often that has solved problems that I never thought might be file system related. YMMV. End update.

Comments welcome, of course.

Editorial standards