Thanks to a collaboration between long-time reliability researcher Prof. Bianca Schroeder and enterprise storage leader NetApp, we now have the results of an in-depth study, presented in a paper at this week's Usenix File And Storage Technology (FAST 20) conference.
Most large scale studies have focused on hard drives, or on SSDs in cloud data centers. The former still account for the huge majority of all stored data, and have become amazingly reliable as well as cost-effective.
But SSDs have made substantial inroads in the enterprise as well, with all flash arrays providing massive performance in compact and energy-efficient packages. That's what this study examines.
While there have been studies of SSD reliability in cloud data centers, their results aren't applicable elsewhere. Cloud vendors don't generally use standard SSDs - when you buy 100,000 at a pop, vendors give you what you want - nor do they use the RAID protocols common in enterprise gear. And the SSDs they do employ are in system architectures that include optimizations not available to array vendors.
The study covers almost 1.4 million drives from three different manufacturers, 18 different models, 12 different capacities, and four flash technologies, (SLC, cMLC (consumer-class), eMLC (enterprise-class), and 3D-TLC). Rich data, provided from NetApp's automated telemetry, includes drive replacements (including reasons for replacements), bad blocks, usage, drive age, firmware versions, drive role (e.g., data, parity or spare), and a number of other things.
These drives are installed in NetApp enterprise storage systems running their ONTAP OS, typically with data center mounting, cooling, and power conditioning. The SSDs are performing in near-ideal environments, rather than the vagaries of, say, notebooks or business PCs, so reliability differences can be attributed the SSDs, not customer infrastructures.
One of the most surprising findings is that drives with LESS usage experience higher replacement rates. Another surprise is that infant mortality actually rises over the first year of field use before starting to decline.
Another finding is that SLC (single level cell), the most costly drives, are NOT more reliable than MLC drives. And while the newest high density 3D-TLC (triple level cell) drives have the highest overall replacement rate, the difference is likely not caused by the 3D-TLC technology, but the capacity level or cell size employed in the drive. Higher density cells exhibit more failures.
The best news, which was the biggest worry when SSDs came out a decade ago, is that flash wear out is a non-issue. Even after 2-3 years, less than two percent of the rated life is consumed on average. Even the drives in the 99.9th percentile have consumed only 33 percent of their rated life.
Being enterprise arrays, NetApp's systems are zealous in detecting anomalous drive behavior and taking corrective action. A few major fault types lead to most replacements, ranked from most serious to least below.
There's much more to the study than the summary here. But for people who aren't building storage arrays, the study offers data that should reassure on drive life, while also suggesting some ways to get more economical storage.
While this is an important and useful study, for many of us hard drives will remain primary storage for decades to come. While they are a little less reliable, and use more power, their cost per bit can't be beat.
Comments welcome, as always. What surprised you the most?