What your disk drive isn't telling you

What your disk drive isn't telling you

Summary: because it's clueless. You know you've got a problem when your disk drive goes ka-thunk.

TOPICS: Storage, Hardware

because it's clueless.

You know you've got a problem when your disk drive goes ka-thunk. A study of 1.53 million disks finds that data errors are much more common than outright failures. You just don't know it. What's worse, neither do the people who design PC file systems.

A different kind of latency Unreported or latent disk errors are real. Storage array vendors have stopped recommending RAID 5 with SATA drives because of the very good chance you won't get your data back.

But until Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy and Jiri Schindler analyzed the error logs of over 50,000 systems, no one had done a large-scale study of the problem before. Lakshmi was at the U of Wisconsin-Madison while the latter 3 work at the large NAS vendor Network Appliance. They published An Analysis of Latent Sector Errors in Disk Drives last year.

Disks have a lot of error conditions, most of them - thankfully - transient. This study focused on Latent Sector Errors (LSE) which are defined as Bad News:

This error occurs when a particular disk sector cannot be read or written, or when there is an uncorrectable ECC error. Any data previously stored in the sector is lost.

[emphasis added]

Results Since most ZDnet readers are running PATA or SATA drives, I'll focus on the team's results for what they call - in apparent deference to NetApp marketing - nearline drives, as opposed to the costly enterprise drives used in high-end arrays. For me and you, nearline or consumer drives are the online drives that we rely on every day.

8.5% of all the consumer drives developed LSE. That's the good news.

The team found several factors that contribute to LSE.

  • Size matters. As disk size increases, so does the fraction of disks with LSE.
  • Age matters. 20% of some consumer disks had LSE after 24 months. LSE rates climbed with age.
  • Vendor matters. They also found that some vendors had much higher LSE than others. Due to the industry omerta they won't rat out the offenders, but you can bet NetApp isn't buying their disks.
  • Errors matter. A drive that develops one error is much more likely to develop a second.

Consumer/SOHO users with large, cheap, old disks will see LSE. Another reason Desktop RAID is a bad idea.

Implications for PC file systems File systems rely on disk-based data structures to keep track of your stuff. One of the key findings of the team is that disk errors tend to congregate near each other, like congressmen and lobbyists.

After the first LSE, a second LSE is also much more likely. LSE isn't random in time or space.

Therefore, file systems that replicate critical data across the disk are much less likely to lose your data than those, like the Linux ReiserFS, place critical structures in one contiguous area. Perhaps someone with specific knowledge of how NTFS and HFS+ lay out their data structures could comment.

The Storage Bits take We all like big cheap drives, but this study shows they come with some trade-offs. This data isn't causing me to give mine up.

What I am doing is backing up every night to a bootable external drive. If you aren't backing up now, I hope you'll start soon.

Update: if you are a home user, is there anything you should differently? Yes.

  • Back up your data. Disks are amazingly reliable, but they do fail. Be prepared.
  • Replacing disks when they turn 3 looks like a good idea if unplanned downtime would cost you money. I have a backup computer system for that very reason. No computer = no income. So I take this stuff seriously.
  • Don't use desktop RAID 5. If a drive fails and you encounter an LSE on the rebuilding drives you have to go to your backup anyway. You don't need the hassle.

I beat on my machine hard, using dozens of programs a week and creating thousands of files, so I use an OS X disk repair utility every couple of months to rebuild my directory. I'm amazed at how often that has solved problems that I never thought might be file system related. YMMV. End update.

Comments welcome, of course.

Topics: Storage, Hardware

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • No safe haven

    It seems that our data simply have no safe place to reside anywhere. Hard disks are unsafe, DVD's are unsafe (disk rot).....

    Come on, industry, give us a safe, reliable and enduring storage medium. How hard can it be?

    Greetz, Pjotr.
    • Storage is hard

      Unlike the transient nature of network packets and CPU register contents, we need
      our data for years. Stored data, no matter what the medium, is always at risk.

      Paper burns, stone can decay, clay breaks and electronic media suffer many ills,
      including obsolescence.

      The good news: things are getting better. The bad news: not fast enough.

      R Harris
    • It can be very hard

      The Second Law of Thermodynamics that says that everything has a natural tendency to decay from an orderly state to a chaotic one. There's no way around it. As Mr. Scott put it, "I canna change the laws of physics." The best that can be done is to slow the process by recognizing and anticipating likely sources of decay and preventing or compensating for them.
  • Broken main page link, and thanks

    Robin - the ZD News page has your article linked as:
    http://news.zdnet.com/PC%20file%20systems. (*with the trailing . )

    so, I had to do a Search in the blogs section to find your interesting and somewhat troubling article
  • SCSI vs. SATA...

    Are there any numbers comparing these two types of drives?

    I noticed RAID5 not recommended for SATA. What about SCSI.

    I've always been of the opinion that SCSI drives are built better (per Seagates web site).
    • SAS in place of SCSI

      Serial Attached SCSI is the new SCSI.

      Still, SAS and SCSI drives seem to have a better track record in my opinion. They are smaller. The downside is that they are spinning much faster, which leads to an increased risk.

      Still, I think over all, SCSI based drives seem to be built better. I think SSD drives will have a shot at displacing HDD when capacity gets up to par. The advantage to SSD is the fact that they have no moving parts to fail. There are several downsides though.
      • To compare SCSI vs SATA check this article

        <a href="http://storagemojo.com/2008/02/18/latent-sector-errors-in-disk-
        drives/" target="_blank">here</a>.

        I'm not sure that "build" quality is better, maybe its the ECC, but SCSI/SAS drives do
        show some superior stats in the study.
        R Harris
        • There is a reason SCSI are Enterprise class drives

          The extra cost is representable in the superior stats.
          Many people.(Gamers and the like excluded) Will not spend hundreds of dollars for 10K or 15K drives when the slower and [u]cheaper[/u] drives will do. And can get more capacity for less money.
        • Seagate goes in depth...

          about the superior quality and construction of SCSI vs. SATA.
      • Wrong on the SSD drive failures

        High density flash parts degrade similiarly to HDD in terms of time. All SSD drives have spare sectors for the same reason HDDs have them. You still need to scan the data and make sure the drive reads all blocks to move data to spare sectors before the ECC error becomes uncorrectable.
  • RE: What your disk drive isn't telling you

    So is there anyway for the average home user to detect "Latent Sector Errors" short of a crash?
    • The drive handles failed writes -

      by automatically rewriting to a spare block. You won't see it and the drive won't report
      them. Failed reads will look like "file not found" and the like - unless the failure is in a
      directory block.

      Then you'll be glad you back up. I updated the post in response to your good

      R Harris
  • Doesn't pass the smell test

    If all these errors are occurring where are the tangible
    results (I can't open this document anymore, my computer
    won't boot anymore), etc.

    The fact is, error correcting algorithms and checksums are
    designed to handle this kind of stuff. It's not like the
    physicists who developed these algorithms don't know
    about error and how to correct for it.

    What these folks need to do is read/write random bits
    across a storage system for a month using the standard
    OS read/write routines and see how many files have
    actually become corrupted.

    Of course, when you do that, you'll find the number
    vanishingly small. Why? Because the software performs
    error checking on read/write operations specifically to
    catch any failures in the media.
    • *urrrggghh* I..have...to....agree (*barf*) with frgough

      The simple fact is that these are not translating into real-world examples of data loss. If they were, you would be seeing a revolt against HD manufacturers.
    • You need to look at the statistics again

      It said that 8.5% of drives get these errors. That's not very much. Plus, the errors could show up anywhere, and drives are getting large enough these days that we don't fill them all the way up all that often. Plus, even if we do, we don't tend to <i>use</i> all the data very frequently.

      How much stuff do you have sitting around on your HD that you haven't touched in months or years and don't really plan to ever touch again? I recently cleaned out my hard drive and came away with an extra 50 GB of space, from removing programs and files that I had no use for anymore! If a user has a disc error somewhere amidst all that clutter, they might never realize it. That doesn't mean that it doesn't happen, or that it's not a threat to your data.
      • But we do fill them up.

        10 years ago I ripped a portion of my music collection to 128K, when drives got larger they went to 320K, Now that 500GB is the norm for drives in the $100 range, my whole collection is going to lossless. (and I hope to not do this again) This is for convenience and so I can select what I want to listen to through my home entertainment system, not by going to the basement to find a CD. I also have 2 copies and a RAID1 because although I own all the music I put on my drives, it represents a huge investment in time.

        As media servers take hold and electronic distribution of HD video becomes the norm, we are just buying bits instead of disks. It will then become even more important that these bits are secure as they will then represent money, not just time.

        I, for one, am starting to investigate enterprise class storage. I just hope that 'enterprise class' actually means something and is nto the smokescreen and marketing garbage that "Premium Diesel" is.

        I can see it now, Soon we'll be able to buy a rider from our homeowners insurance to insure electronically downloaded music and movies.
  • Well, duh

    What's the point? It's like saying that the city water pipes going to your home are old and, at any second, could rupture in any of a million places. Of course they could, and every once in a great while a break occurs. Then it gets fixed, and life goes on.

    I've seen data errors from hard drives, but they are pretty few and far between. If, however, the point of the story is to back up your important data, then with that I'd completely agree.
    • The implications are important if you have a lot of drives

      which is not where most consumers are. If you have less than a dozen independent
      drives your experience is mostly a matter of luck. If you have a parity RAID system,
      then you'd like the rebuild algorithm to look at how old your drives are and how
      many errors they've had when figuring out how fast to rebuild.<br>
      For most consumers other than video, photo and music collectors, the message is:
      backup! Unlike water, most of us can't replace our family photos and videos.
      R Harris
  • ZFS?

    From what I understand, ZFS anticipates these types of failures. In any type of redundant configuration (RAIDz, mirroring), it is designed to constantly monitor and relocate sectors with questionable integrity.
    • Yes, ZFS...

      Correct, ZFS checksums each block of storage and uses that to detect these types of data errors. It can detect the data errors regardless of redundancy, and in the worst case will provide the requestor of the data with an error message rather than handing it corrupted data.

      In a redundant setting (Raid-Z, mirrors, etc), ZFS can use the redundant information to correct these errors when they are detected. It is a vastly superior file system for handling data with any degree of importance.

      Currently available on Solaris 10 and Mac OS X Leapord (though the Leapord version is read-only unless you update from http://trac.macosforge.org/projects/zfs/wiki/ to get the read-write support...)

      To the best of my knowledge, there is currently nothing like this for a M$ platform.

      Yet another way in which using Windoze helps to jeapordize your important data...