ie8 fix

Can vendor MTBFs be trusted?

By | September 7, 2011, 6:49am PDT

Summary: Of course not, silly rabbit. But it isn’t all their fault: for most users MTBFs are meaningless gibberish. Why rain on the parade?

Since I started following storage research I’ve noticed an interesting fact: independent reviews find that vendor MTBF numbers are almost always too optimistic. Examples:

  • 100,000 drive study. “While the datasheet AFRs are between 0.58% and 0.88%, the observed ARRs range from 0.5% to as high as 13.5%. That is, the observed ARRs by dataset and type, are by up to a factor of 15 higher than datasheet AFRs.”
  • Google’s disk failure experience. Disk MTBF numbers significantly understate failure rates. If you plan on AFRs that are 50% higher than MTBFs suggest, you’ll be better prepared.
  • The RAID 5 problem. RAID vendors blithely assumed that disk failures are independent, maximizing their mean-time-to-data-loss number, when research has found - as any sysadmin could attest - that they aren’t.
  • DRAM error rates. Research found that DIMM error rates are hundreds to thousands of time higher than expected.

This isn’t as bad as the cigarette industry lying about smoking and cancer for decades, but informed consumers have to wonder why vendors don’t come clean. Greed? Ignorance? Sloth? Fear? Or something else?

What’s going on?
Several issues lead to vendor misinformation:

  • Competitive pressure. If the competitor says X then match them or lose.
  • Optimistic assumptions. RAID vendors assume that hard drive failures are independent events, despite knowing that they aren’t, and don’t factor in drive read error rates, giving an optimistic gloss on time to data loss.
  • Accelerated life testing. Typically components are put through environmental hell testing - high temps, voltage fluctuations, 7×24 activity - that are supposed to simulate the aging process. But many aging processes, such as lubricant aging and migration in disks aren’t easily simulated.
  • System issues. Drive vendors report that some 50-60% of “failed” drives have no trouble found in testing. Is the problem poor vendor test coverage, flaky system design, bad drivers or buggy firmware? Or all of the above? Component vendors don’t control their environment, but that’s what standards - like SATA and SAS - are supposed to fix.

Why should you care?
The problem with all these statistics is that they are almost meaningless to most users. Why? Because you aren’t buying hundreds or thousands of units.

You just buy 1 or a handful. If they work, you’re happy. If they don’t, you aren’t.

The fact that 2,000 other people are thrilled means nothing to YOU when your new SSD goes belly up. Your failure rate is 100% and your MTBF is 2 days.

In mature markets, like disk drives, most vendors are similar because they have to be: OEM buyers know the real numbers and rate vendors accordingly. In new markets, like SSDs, the numbers are all over the map and no one’s talking.

A more reliable device improves your chances of a happy long-term relationship - but doesn’t guarantee it. Your mileage will vary.

The Storage Bits take
Losing a server to power fry or fan melt isn’t the end of the world. Losing your data is a lot worse.

Storage vendor MTBFs and MTTDL (mean time to data loss) numbers are meaningless for small installations. Nor will any storage vendor compensate you for the value of lost data. That’s how much they trust their numbers.

When it comes to your data put your faith elsewhere. As the redoubtable David S. H. Rosenthal - former Sun Distinguished Engineer and employee #4 at Nvidia puts it, only 3 things will improve your data protection chances:

  1. The more copies, the safer.
  2. The more independent the copies, the safer.
  3. The more frequently the copies are checked for corruption, the safer.

Remember, the the Universe hates your data. Be safe out there.

Comments welcome, of course.

Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Topics

Robin Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small.

Disclosure

Robin Harris

Robin Harris is a president of TechnoQWAN, a consulting and analyst firm in northern Arizona. He also writes StorageMojo.com, a blog which accepts advertising from companies in the storage industry, and has a 25 year history with IT vendors. He has many industry contacts, many of whom are friends and all of whom he has opinions about. Robin has relationships with many companies in the technology industry. Every company he writes about may have sought to influence his opinion through carefully-crafted marketing messages and self-serving white papers, gifts ranging from desk calendars, t-shirts, lunches and trips as well as analyst or consulting assignments. He also invests in some technology companies. He may accept payment for services in stock as well. Robin discloses financial investments in or client relationships with companies named in Storage Bits. To help readers sort out the gold from the dross in his writings, Robin tries to communicate his reasons as clearly as he can. If you agree, you are intelligent and discerning. If you disagree, well, you disagree. In all cases, Robin encourages readers to subject everything they read, see or hear on the internet or from politicians to some simple questions: * What assumptions are implicit in the world view and judgments of the author? * What, if any, is the factual basis for the opinions the author expresses? * Is it reasonable, logical and clear? Your critical faculties: use ‘em or lose ‘em!

Biography

Robin Harris

Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small. He introduced a couple of multi-billion dollar storage products (DLT, the first Fibre Channel array) to market, as well as a many smaller ones. Earlier he spent 10 years marketing servers and networks. After leaving corporate life he founded TechnoQWAN, a consulting and analyst firm. He also developed StorageMojo into one of the top storage industry blogs.

Robin writes, consults, coaches and lives among the mountains of northern Arizona.

Related Discussions on TechRepublic

Did you know you can take part in these discussions with your ZDNet membership?
5
Comments

Join the conversation!

Just In

Disregard MTBF's completely.
Joe.Smetona 19th Nov
Buy a perpendicular recording technology drive with FDB (fluid dynamic bearings) and personally, I like to replace 3.5" SATA with 2.5" SATA drives in desktops. The connectors are the same and I think the physics is better with lower power consumption. Ever feel the large IC on the bottom of a 3.5" drive when the drive is operating? It's the driver IC for the servos and it gets very hot. less current = more reliability. Seagate Momentus also seem like a good alternative and upgrading to 7200 RPM does not seem to detract from longevity, especially with FDB.
0 Votes
+ -
No vendor MTBF's cannot be trusted and the problem is much worse than it used to be. Working for a fault tolerant vendor 20 years ago we tested drives fairly extensively and at the time if a vendor MTBF number was better than another drives number, one could truly assume that the drive was more reliable. Now we have all kinds of crap from the storage vendors who use differing techniques and measurments for products choosing the best number they can find instead providing comparable numbers in their product line.

I tell people if you want to be reasonably sure your data is safe, then print it with the least fading ink you can on high quality paper and put it in a vault. No storage vendor provides even close to the reliability that they did 15 years ago.
0 Votes
+ -
Contributr
@oldsysprog
I'm not sure how the problem is worse: when I started in the business a competitive disk drive MTBF was 25,000 hours - about 3 years - at $20/MB.

Different is the word I'd use.
0 Votes
+ -
RE: Can vendor MTBFs be trusted?
oldsysprog 18th Nov
@R Harris

The thing was that 25 years ago,m you might have a low MBTF but it was pretty accurate. Nowadays you have a high MBTF, but in my testing for clients I have found that in many cases the actual MBTF is less than it was 25 years ago. Some of the worst cases are the disk drives with "smart controllers" and the SSD's, both of them have numbers that are just sick.
For example, distributing your files and adding PAR files improves the chances of recovering all your data even if a significant chunk of them go.

Unpowered offline copies combined with online active copies improves survival by reducing the chances of surge failures.

And so on and so on.
0 Votes
+ -
Disregard MTBF's completely.
Joe.Smetona 19th Nov
Buy a perpendicular recording technology drive with FDB (fluid dynamic bearings) and personally, I like to replace 3.5" SATA with 2.5" SATA drives in desktops. The connectors are the same and I think the physics is better with lower power consumption. Ever feel the large IC on the bottom of a 3.5" drive when the drive is operating? It's the driver IC for the servos and it gets very hot. less current = more reliability. Seagate Momentus also seem like a good alternative and upgrading to 7200 RPM does not seem to detract from longevity, especially with FDB.

Join the conversation!

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]
ie8 fix

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources
ie8 fix