How SSD power faults scramble your data

Flash SSDs are non-volatile, so what could go wrong when power fails? A great deal, even on high-end 'enterprise' SSDs.

We've got over 50 years of experience with spinning disks in all kinds of conditions, ranging from notebooks to massive big iron arrays. SSDs, not so much. And boy, do we have a lot to learn.

Despite billions of dollars spent on backup power batteries and generators, power failures at major datacenters are not uncommon — just ask Netflix — so this is a real issue. Given proprietary Flash Translation Layers (FTL), there's no easy way to understand SSD behavior without testing.

In Understanding the Robustness of SSDs under Power Fault (PDF), researchers Mai Zheng and Feng Qin of Ohio State and Mark Lillibridge and Joseph Tucek of HP Labs look at how power faults affect flash-based SSDs. Short answer: It's not pretty.

The research

The team developed hardware to inject power faults and software to stress devices and check post-fault consistency. These were used to check 15 different SSDs and two hard drives.

The authors looked for several types of errors, including bit corruption, shorn writes, metadata corruption, and dead (bricked) devices. Write data was configured to enable detection of these and other errors.

Three workloads — concurrent random writes, concurrent sequential writes, and single-threaded sequential writes — maximized the SSD's internal workloads. SSDs have several background tasks, such as garbage collection, running constantly to keep the SSD ready and organized.

Tested SSDs

15 different SSDs — 10 different models from five vendors — were tested. Prices ranged from 63¢/GB to $6.50/GB using both MLC and SLC flash. Two hard drives, one low end and one high end, were also tested.

Vendor names were not revealed.

Results

The good news: Of six expected failures, only five were observed; and two of the devices behaved as expected. The bad news: 13 of the devices had poor failure behavior.

Every failed device lost some amount of data or became massively corrupted under power faults.

Bit corruption hit three devices; three had shorn writes; eight had serializability errors; one device lost one third of its data; and one SSD bricked. The low-end hard drive had some unserializable writes, while the high-end drive had no power fault failures.

The two SSDs that had no failures? Both were MLC 2012 model years with a mid-range — $1.17/GB — price.

The Storage Bits take

Because it is persistent, storage is the hardest part of IT infrastructure. There are myriad ways data gets scrambled.

This paper reminds us that SSDs are very new technology, with idiosyncrasies still being engineered around. We're still five years away from the average enterprise SSD being as reliable as the average enterprise hard drive is today.

Home and small office SSD users would be wise to have a battery backup on critical servers and desktops. Notebooks, of course, already have a battery backup.

Comments welcome, as always. The paper was presented at FAST 13. Have you seen any power-related SSD problems?

Newsletters

You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
Subscription failed.
See All
See All