Can vendor MTBFs be trusted?

Can vendor MTBFs be trusted?

Summary: Of course not, silly rabbit. But it isn't all their fault: for most users MTBFs are meaningless gibberish. Why rain on the parade?

SHARE:
TOPICS: Storage, Hardware
5

Since I started following storage research I've noticed an interesting fact: independent reviews find that vendor MTBF numbers are almost always too optimistic. Examples:

  • 100,000 drive study. "While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs range from 0.5% to as high as 13.5%. That is, the observed ARRs by dataset and type, are by up to a factor of 15 higher than datasheet AFRs."
  • Google's disk failure experience. Disk MTBF numbers significantly understate failure rates. If you plan on AFRs that are 50% higher than MTBFs suggest, you’ll be better prepared.
  • The RAID 5 problem. RAID vendors blithely assumed that disk failures are independent, maximizing their mean-time-to-data-loss number, when research has found - as any sysadmin could attest - that they aren't.
  • DRAM error rates. Research found that DIMM error rates are hundreds to thousands of time higher than expected.

This isn't as bad as the cigarette industry lying about smoking and cancer for decades, but informed consumers have to wonder why vendors don't come clean. Greed? Ignorance? Sloth? Fear? Or something else?

What's going on? Several issues lead to vendor misinformation:

  • Competitive pressure. If the competitor says X then match them or lose.
  • Optimistic assumptions. RAID vendors assume that hard drive failures are independent events, despite knowing that they aren't, and don't factor in drive read error rates, giving an optimistic gloss on time to data loss.
  • Accelerated life testing. Typically components are put through environmental hell testing - high temps, voltage fluctuations, 7x24 activity - that are supposed to simulate the aging process. But many aging processes, such as lubricant aging and migration in disks aren't easily simulated.
  • System issues. Drive vendors report that some 50-60% of "failed" drives have no trouble found in testing. Is the problem poor vendor test coverage, flaky system design, bad drivers or buggy firmware? Or all of the above? Component vendors don't control their environment, but that's what standards - like SATA and SAS - are supposed to fix.

Why should you care? The problem with all these statistics is that they are almost meaningless to most users. Why? Because you aren't buying hundreds or thousands of units.

You just buy 1 or a handful. If they work, you're happy. If they don't, you aren't.

The fact that 2,000 other people are thrilled means nothing to YOU when your new SSD goes belly up. Your failure rate is 100% and your MTBF is 2 days.

In mature markets, like disk drives, most vendors are similar because they have to be: OEM buyers know the real numbers and rate vendors accordingly. In new markets, like SSDs, the numbers are all over the map and no one's talking.

A more reliable device improves your chances of a happy long-term relationship - but doesn't guarantee it. Your mileage will vary.

The Storage Bits take Losing a server to power fry or fan melt isn't the end of the world. Losing your data is a lot worse.

Storage vendor MTBFs and MTTDL (mean time to data loss) numbers are meaningless for small installations. Nor will any storage vendor compensate you for the value of lost data. That's how much they trust their numbers.

When it comes to your data put your faith elsewhere. As the redoubtable David S. H. Rosenthal - former Sun Distinguished Engineer and employee #4 at Nvidia puts it, only 3 things will improve your data protection chances:

  1. The more copies, the safer.
  2. The more independent the copies, the safer.
  3. The more frequently the copies are checked for corruption, the safer.

Remember, the the Universe hates your data. Be safe out there.

Comments welcome, of course.

Topics: Storage, Hardware

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

5 comments
Log in or register to join the discussion
  • RE: Can vendor MTBFs be trusted?

    No vendor MTBF's cannot be trusted and the problem is much worse than it used to be. Working for a fault tolerant vendor 20 years ago we tested drives fairly extensively and at the time if a vendor MTBF number was better than another drives number, one could truly assume that the drive was more reliable. Now we have all kinds of crap from the storage vendors who use differing techniques and measurments for products choosing the best number they can find instead providing comparable numbers in their product line.

    I tell people if you want to be reasonably sure your data is safe, then print it with the least fading ink you can on high quality paper and put it in a vault. No storage vendor provides even close to the reliability that they did 15 years ago.
    oldsysprog
    • RE: Can vendor MTBFs be trusted?

      @oldsysprog
      I'm not sure how the problem is worse: when I started in the business a competitive disk drive MTBF was 25,000 hours - about 3 years - at $20/MB.

      Different is the word I'd use.
      R Harris
      • RE: Can vendor MTBFs be trusted?

        @R Harris

        The thing was that 25 years ago,m you might have a low MBTF but it was pretty accurate. Nowadays you have a high MBTF, but in my testing for clients I have found that in many cases the actual MBTF is less than it was 25 years ago. Some of the worst cases are the disk drives with "smart controllers" and the SSD's, both of them have numbers that are just sick.
        oldsysprog
  • Also... use multiple mechanisms at the same time.

    For example, distributing your files and adding PAR files improves the chances of recovering all your data even if a significant chunk of them go.

    Unpowered offline copies combined with online active copies improves survival by reducing the chances of surge failures.

    And so on and so on.
    TheWerewolf
  • Disregard MTBF's completely.

    Buy a perpendicular recording technology drive with FDB (fluid dynamic bearings) and personally, I like to replace 3.5" SATA with 2.5" SATA drives in desktops. The connectors are the same and I think the physics is better with lower power consumption. Ever feel the large IC on the bottom of a 3.5" drive when the drive is operating? It's the driver IC for the servos and it gets very hot. less current = more reliability. Seagate Momentus also seem like a good alternative and upgrading to 7200 RPM does not seem to detract from longevity, especially with FDB.
    Joe.Smetona