X
Tech

SSD reliability in the real world: Google's experience

Using data from millions of drive days in Google datacenters, a new paper offers production lifecycle data on SSD reliability. Surprise! SSDs fail differently than disks - and in a dangerous way. Here's what you need to know.
Written by Robin Harris, Contributor

SSDs are a new phenomenon in the datacenter. We have theories about how they should perform, but until now, little data. That's just changed.

The FAST 2016 paper Flash Reliability in Production: The Expected and the Unexpected, (the paper is not available online until Friday) by Professor Bianca Schroeder of the University of Toronto, and Raghav Lagisetty and Arif Merchant of Google, covers:

  • Millions of drive days over 6 years
  • 10 different drive models
  • 3 different flash types: MLC, eMLC and SLC
  • Enterprise and consumer drives

Key conclusions

  • Ignore Uncorrectable Bit Error Rate (UBER) specs. A meaningless number.
  • Good news: Raw Bit Error Rate (RBER) increases slower than expected from wearout and is not correlated with UBER or other failures.
  • High-end SLC drives are no more reliable that MLC drives.
  • Bad news: SSDs fail at a lower rate than disks, but UBER rate is higher (see below for what this means).
  • SSD age, not usage, affects reliability.
  • Bad blocks in new SSDs are common, and drives with a large number of bad blocks are much more likely to lose hundreds of other blocks, most likely due to die or chip failure.
  • 30-80 percent of SSDs develop at least one bad block and 2-7 percent develop at least one bad chip in the first four years of deployment.

The Storage Bits take

Two standout conclusions from the study. First, that MLC drives are as reliable as the more costly SLC "enteprise" drives. This mirrors hard drive experience, where consumer SATA drives have been found to be as reliable as expensive SAS and Fibre Channel drives.

One of the major reasons that "enterprise" SSDs are more expensive is due to greater over-provisioning. SSDs are over-provisioned for two main reasons: to allow for ample bad block replacement caused by flash wearout; and, to ensure that garbage collection does not cause write slowdowns.

The paper's second major conclusion, that age, not use, correlates with increasing error rates, means that over-provisioning for fear of flash wearout is not needed. None of the drives in the study came anywhere near their write limits, even the 3,000 writes specified for the MLC drives.

But it isn't all good news. SSD UBER rates are higher than disk rates, which means that backing up SSDs is even more important than it is with disks. The SSD is less likely to fail during its normal life, but more likely to lose data.

I'll be digging deeper into the data this weekend. Stay tuned!

Comments welcome, as always.

Editorial standards