Thanks to a collaboration between long-time reliability researcher Prof. Bianca Schroeder and enterprise storage leader NetApp, we now have the results of an in-depth study, presented in a paper at this week's Usenix File And Storage Technology (FAST 20) conference.
Most large scale studies have focused on hard drives, or on SSDs in cloud data centers. The former still account for the huge majority of all stored data, and have become amazingly reliable as well as cost-effective.
But SSDs have made substantial inroads in the enterprise as well, with all flash arrays providing massive performance in compact and energy-efficient packages. That's what this study examines.
While there have been studies of SSD reliability in cloud data centers, their results aren't applicable elsewhere. Cloud vendors don't generally use standard SSDs - when you buy 100,000 at a pop, vendors give you what you want - nor do they use the RAID protocols common in enterprise gear. And the SSDs they do employ are in system architectures that include optimizations not available to array vendors.
The study covers almost 1.4 million drives from three different manufacturers, 18 different models, 12 different capacities, and four flash technologies, (SLC, cMLC (consumer-class), eMLC (enterprise-class), and 3D-TLC). Rich data, provided from NetApp's automated telemetry, includes drive replacements (including reasons for replacements), bad blocks, usage, drive age, firmware versions, drive role (e.g., data, parity or spare), and a number of other things.
These drives are installed in NetApp enterprise storage systems running their ONTAP OS, typically with data center mounting, cooling, and power conditioning. The SSDs are performing in near-ideal environments, rather than the vagaries of, say, notebooks or business PCs, so reliability differences can be attributed the SSDs, not customer infrastructures.
One of the most surprising findings is that drives with LESS usage experience higher replacement rates. Another surprise is that infant mortality actually rises over the first year of field use before starting to decline.
Another finding is that SLC (single level cell), the most costly drives, are NOT more reliable than MLC drives. And while the newest high density 3D-TLC (triple level cell) drives have the highest overall replacement rate, the difference is likely not caused by the 3D-TLC technology, but the capacity level or cell size employed in the drive. Higher density cells exhibit more failures.
- Not all SSDs offer the same reliability. Annual Replacement Rates (ARR) range from as little as 0.07 percent to nearly 1.2 percent, with an average ARR across the entire population of 0.22 percent.
- Even for drive models from the same manufacturer, with the same flash technology, age, and similar capacity, ARR can vary dramatically, from 0.53 percent for one 15TB drive versus 1.13 percent for its 15.3TB stable mate.
- Bad block provisioning is quite generous. Even after several years, the average of consumed spare blocks is less than 15 percent. Even the drives at the 99.9 percentile of consumed spare blocks have consumed only 33 percent of their spare blocks.
The best news, which was the biggest worry when SSDs came out a decade ago, is that flash wear out is a non-issue. Even after 2-3 years, less than two percent of the rated life is consumed on average. Even the drives in the 99.9th percentile have consumed only 33 percent of their rated life.
Being enterprise arrays, NetApp's systems are zealous in detecting anomalous drive behavior and taking corrective action. A few major fault types lead to most replacements, ranked from most serious to least below.
- In almost 33 percent of the failures, the SCSI layer saw a hardware error reported by the SSD, that triggered immediate SSD replacement and data reconstruction. Such errors might, for example, be caused by ECC errors originating from the drive's DRAM. Total drive failure - no response - is rare, less than 1 percent of all failures.
- Almost 14 percent of drive replacements were caused by lost writes, where the EDC code on the 4k block doesn't match up with the data. Once a drive has several of these errors, it is marked failed.
- Aborted commands lead to another 14 percent of drives being replaced. If the host sends a write, but the data isn't written, that's marked as an aborted command.
- The least serious failures, in about 25 percent of the cases, is where drives are replaced out of an abundance of caution, based on predictive and threshold data. The drive hasn't yet failed, but the metrics are concerning, so the system initiates a disk copy so the user never sees a problem.
The Storage Bits take
There's much more to the study than the summary here. But for people who aren't building storage arrays, the study offers data that should reassure on drive life, while also suggesting some ways to get more economical storage.
While this is an important and useful study, for many of us hard drives will remain primary storage for decades to come. While they are a little less reliable, and use more power, their cost per bit can't be beat.
Comments welcome, as always. What surprised you the most?