How SSD power faults scramble your data
Summary: Flash SSDs are non-volatile, so what could go wrong when power fails? A great deal, even on high-end 'enterprise' SSDs.
We've got over 50 years of experience with spinning disks in all kinds of conditions, ranging from notebooks to massive big iron arrays. SSDs, not so much. And boy, do we have a lot to learn.
Despite billions of dollars spent on backup power batteries and generators, power failures at major datacenters are not uncommon — just ask Netflix — so this is a real issue. Given proprietary Flash Translation Layers (FTL), there's no easy way to understand SSD behavior without testing.
In Understanding the Robustness of SSDs under Power Fault (PDF), researchers Mai Zheng and Feng Qin of Ohio State and Mark Lillibridge and Joseph Tucek of HP Labs look at how power faults affect flash-based SSDs. Short answer: It's not pretty.
The research
The team developed hardware to inject power faults and software to stress devices and check post-fault consistency. These were used to check 15 different SSDs and two hard drives.
The authors looked for several types of errors, including bit corruption, shorn writes, metadata corruption, and dead (bricked) devices. Write data was configured to enable detection of these and other errors.
Three workloads — concurrent random writes, concurrent sequential writes, and single-threaded sequential writes — maximized the SSD's internal workloads. SSDs have several background tasks, such as garbage collection, running constantly to keep the SSD ready and organized.
Tested SSDs
15 different SSDs — 10 different models from five vendors — were tested. Prices ranged from 63¢/GB to $6.50/GB using both MLC and SLC flash. Two hard drives, one low end and one high end, were also tested.
Vendor names were not revealed.
Results
The good news: Of six expected failures, only five were observed; and two of the devices behaved as expected. The bad news: 13 of the devices had poor failure behavior.
Every failed device lost some amount of data or became massively corrupted under power faults.
Bit corruption hit three devices; three had shorn writes; eight had serializability errors; one device lost one third of its data; and one SSD bricked. The low-end hard drive had some unserializable writes, while the high-end drive had no power fault failures.
The two SSDs that had no failures? Both were MLC 2012 model years with a mid-range — $1.17/GB — price.
The Storage Bits take
Because it is persistent, storage is the hardest part of IT infrastructure. There are myriad ways data gets scrambled.
This paper reminds us that SSDs are very new technology, with idiosyncrasies still being engineered around. We're still five years away from the average enterprise SSD being as reliable as the average enterprise hard drive is today.
Home and small office SSD users would be wise to have a battery backup on critical servers and desktops. Notebooks, of course, already have a battery backup.
Comments welcome, as always. The paper was presented at FAST 13. Have you seen any power-related SSD problems?
Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.
Talkback
I have not seen any problems, ...
Thanks for the heads up. I will put off any SSD upgrades a bit longer. I still run notebooks on mains power with no battery fairly often.
Had that happen several times with USB flash drives
I downloaded Microsoft Synctoy and now I regularly back up my data on two different hard drives on my network in addition to my thumb drive, now I have three points of failure instead of just the one..saved my bacon more than one time!!
I also back up to two hard drives...
plug/play
write cache
Which OS was uded for these tests?
OS
Hardly the same issue
Typically, flash drives are configured for "Quick Removal", which should write the cached data as soon as possible; if you unplug during this process, nobody but yourself is to blame. Waiting until the activity LED on the flash drive is of for a few seconds, then unplugging it should not cause harm. If there is not activity light on the drive, then I would always use the task bar removal option, just to be safe.
A little bit more complicated are portable USB hard drives; they are not optimized for quick removal, so it may take a while until the cached data is written. In that case, the "Safely remove hardware" function should always be used.
Regardless, none of those issues have anything to do with the design of "Plug and Play", which is only supposed to make hardware detection/installation easier; it is not meant for hot-unplug operations at all!
Exactly
The problem will occur occasionally that it won't allow you to unmount because something is tying up the drive even though it's not showing as active....make sure you don't have any open docs or apps associated with the drive.
Except that's not the failure he's talking about...
Also, he's not talking about simple data corruption (which NTFS filesystems can actually handle rather well since they journal the changes and can reconstruct the correct state from the secondary copies of the MFT and other tables - same with extended HFS on Macs) he's talking about the actual SSD being damaged - and unless you're doing something pretty spectacular, powering off a hard drive really can't harm the drive - it just fast pulls the head back into park. So as long as you're not shaking the drive or banging it against the desk, power fails should be harmless to a real hard drive.
he is talking about the same issue
Nothing to do with file system data corruption, which is something else.
USB connectors .......
This will at least reduce the chances of damage.
Re: This will at least reduce the chances of damage.
Having written NVM device drivers for safe applications...
I have found you do not want to have any pending operations (clocking in data or a page write in progress) when power finally does leave. Once in a while (not often but enough to cause pain), you will get all sorts of nasty stuff happen.
Just had an experience...
Yep
That old saying......
Act accordingly.
With any storage medium
backup ^ 2
power faults??