The enterprising folks at Backblaze continue to surprise. In a blog post released this morning Backblaze talks about their experience with enterprise drives.
Annual Failure Rate or AFR is the preferred metric for measuring drive life. Unlike MTBF's it is easy to measure and is readily translatable into expected behavior. AFR is the number of drive failures divided by the number of drive years.
One drive running for 12 months is one drive year. Or 12 drives running for one month.
Backblaze currently runs about 25,000 Consumer SATA drives. They have almost 15,000 drive years and have replaced 613 drives.
They have a smaller number of enterprise drives: six shelves of Dell PowerVault storage; and one EMC storage system with 124 enterprise drives. They also have one Backblaze storage pod – 45 drives – running with enterprise drives for experimental purposes.
The enterprise drives are a mix: SAS 15k RPM drives; SATA 7200 RPM; SAS 10k RPM; and a few SSD drives thrown in for good measure. With the exception of the Backblaze storage pods all of the drives are installed in enterprise-class enclosures with excellent cooling, vibration dampening and high quality power supplies.
Enterprise RAID drives are designed to limit retries so a single failing drive doesn't drag an entire LUN down. But this doesn't seem to make a difference in observed drive failure modes.
As Gleb Budman, Backblaze CEO and co-founder put it in an email to me:
We have limited visibility into the drive stats in the commercial storage systems as the vendors have chosen to not expose the SMART stats. We do get errors in our logs from the enterprise drives just as with the consumer drives and these are typically read/write timeout errors that do not appear to be qualitatively different between the drives.
They have accumulated 368 enterprise drive years and had 17 failures. That is an AFR of 4.6% vs. 4.2% AFR observed on consumer drives.
The Storage Bits take
There are probably a couple of thousand engineers in the storage industry who know the facts about the actual failure rates of different drives and different manufacturers. That's because both vendors and OEM buyers track it.
But until Backblaze came along no one was willing to talk about it. Since Backblaze buys its drives on the open market and they like publicity they can talk freely.
Backblaze's conclusions are not surprising. The only other major study (see Everything you know about disks is wrong) also found no significant difference in drive life.
So what does this mean to you? Simply this: focus on the cost and performance of drives, not their putative MTBF or warranty periods.
Disk drives are incredibly complex precision devices that make the finest Swiss watch look like a dump truck in comparison. And they cost a lot less.
Yes, drives fail. But you should always have at least two copies of your data on different drives. Disk storage is cheap: buy plenty!
Comments welcome, as always. Read the Backblaze post here. What has your experience been with enterprise drive reliability?