Can vendor MTBFs be trusted?

Of course not, silly rabbit. But it isn't all their fault: for most users MTBFs are meaningless gibberish. Why rain on the parade?
Written by Robin Harris, Contributor

Since I started following storage research I've noticed an interesting fact: independent reviews find that vendor MTBF numbers are almost always too optimistic. Examples:

  • 100,000 drive study. "While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs range from 0.5% to as high as 13.5%. That is, the observed ARRs by dataset and type, are by up to a factor of 15 higher than datasheet AFRs."
  • Google's disk failure experience. Disk MTBF numbers significantly understate failure rates. If you plan on AFRs that are 50% higher than MTBFs suggest, you’ll be better prepared.
  • The RAID 5 problem. RAID vendors blithely assumed that disk failures are independent, maximizing their mean-time-to-data-loss number, when research has found - as any sysadmin could attest - that they aren't.
  • DRAM error rates. Research found that DIMM error rates are hundreds to thousands of time higher than expected.

This isn't as bad as the cigarette industry lying about smoking and cancer for decades, but informed consumers have to wonder why vendors don't come clean. Greed? Ignorance? Sloth? Fear? Or something else?

What's going on? Several issues lead to vendor misinformation:

  • Competitive pressure. If the competitor says X then match them or lose.
  • Optimistic assumptions. RAID vendors assume that hard drive failures are independent events, despite knowing that they aren't, and don't factor in drive read error rates, giving an optimistic gloss on time to data loss.
  • Accelerated life testing. Typically components are put through environmental hell testing - high temps, voltage fluctuations, 7x24 activity - that are supposed to simulate the aging process. But many aging processes, such as lubricant aging and migration in disks aren't easily simulated.
  • System issues. Drive vendors report that some 50-60% of "failed" drives have no trouble found in testing. Is the problem poor vendor test coverage, flaky system design, bad drivers or buggy firmware? Or all of the above? Component vendors don't control their environment, but that's what standards - like SATA and SAS - are supposed to fix.

Why should you care? The problem with all these statistics is that they are almost meaningless to most users. Why? Because you aren't buying hundreds or thousands of units.

You just buy 1 or a handful. If they work, you're happy. If they don't, you aren't.

The fact that 2,000 other people are thrilled means nothing to YOU when your new SSD goes belly up. Your failure rate is 100% and your MTBF is 2 days.

In mature markets, like disk drives, most vendors are similar because they have to be: OEM buyers know the real numbers and rate vendors accordingly. In new markets, like SSDs, the numbers are all over the map and no one's talking.

A more reliable device improves your chances of a happy long-term relationship - but doesn't guarantee it. Your mileage will vary.

The Storage Bits take Losing a server to power fry or fan melt isn't the end of the world. Losing your data is a lot worse.

Storage vendor MTBFs and MTTDL (mean time to data loss) numbers are meaningless for small installations. Nor will any storage vendor compensate you for the value of lost data. That's how much they trust their numbers.

When it comes to your data put your faith elsewhere. As the redoubtable David S. H. Rosenthal - former Sun Distinguished Engineer and employee #4 at Nvidia puts it, only 3 things will improve your data protection chances:

  1. The more copies, the safer.
  2. The more independent the copies, the safer.
  3. The more frequently the copies are checked for corruption, the safer.

Remember, the the Universe hates your data. Be safe out there.

Comments welcome, of course.

Editorial standards