How SSD power faults scramble your data

Summary:Flash SSDs are non-volatile, so what could go wrong when power fails? A great deal, even on high-end 'enterprise' SSDs.

We've got over 50 years of experience with spinning disks in all kinds of conditions, ranging from notebooks to massive big iron arrays. SSDs, not so much. And boy, do we have a lot to learn.

Despite billions of dollars spent on backup power batteries and generators, power failures at major datacenters are not uncommon — just ask Netflix — so this is a real issue. Given proprietary Flash Translation Layers (FTL), there's no easy way to understand SSD behavior without testing.

In Understanding the Robustness of SSDs under Power Fault (PDF), researchers Mai Zheng and Feng Qin of Ohio State and Mark Lillibridge and Joseph Tucek of HP Labs look at how power faults affect flash-based SSDs. Short answer: It's not pretty.

The research

The team developed hardware to inject power faults and software to stress devices and check post-fault consistency. These were used to check 15 different SSDs and two hard drives.

The authors looked for several types of errors, including bit corruption, shorn writes, metadata corruption, and dead (bricked) devices. Write data was configured to enable detection of these and other errors.

Three workloads — concurrent random writes, concurrent sequential writes, and single-threaded sequential writes — maximized the SSD's internal workloads. SSDs have several background tasks, such as garbage collection, running constantly to keep the SSD ready and organized.

Tested SSDs

15 different SSDs — 10 different models from five vendors — were tested. Prices ranged from 63¢/GB to $6.50/GB using both MLC and SLC flash. Two hard drives, one low end and one high end, were also tested.

Vendor names were not revealed.

Results

The good news: Of six expected failures, only five were observed; and two of the devices behaved as expected. The bad news: 13 of the devices had poor failure behavior.

Every failed device lost some amount of data or became massively corrupted under power faults.

Bit corruption hit three devices; three had shorn writes; eight had serializability errors; one device lost one third of its data; and one SSD bricked. The low-end hard drive had some unserializable writes, while the high-end drive had no power fault failures.

The two SSDs that had no failures? Both were MLC 2012 model years with a mid-range — $1.17/GB — price.

The Storage Bits take

Because it is persistent, storage is the hardest part of IT infrastructure. There are myriad ways data gets scrambled.

This paper reminds us that SSDs are very new technology, with idiosyncrasies still being engineered around. We're still five years away from the average enterprise SSD being as reliable as the average enterprise hard drive is today.

Home and small office SSD users would be wise to have a battery backup on critical servers and desktops. Notebooks, of course, already have a battery backup.

Comments welcome, as always. The paper was presented at FAST 13. Have you seen any power-related SSD problems?

Topics: Storage, Data Centers, Disaster Recovery, Emerging Tech, Hardware, Outage

About

Harris has been working with computers for over 35 years and selling and marketing data storage for over 30 in companies large and small. He introduced a couple of multi-billion dollar storage products (DLT, the first Fibre Channel array) to market, as well as a many smaller ones. Earlier he spent 10 years marketing servers and networks.... Full Bio

zdnet_core.socialButton.googleLabel Contact Disclosure

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Related Stories

The best of ZDNet, delivered

You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
Subscription failed.