How SSD power faults scramble your data

Summary: Flash SSDs are non-volatile, so what could go wrong when power fails? A great deal, even on high-end 'enterprise' SSDs.

We've got over 50 years of experience with spinning disks in all kinds of conditions, ranging from notebooks to massive big iron arrays. SSDs, not so much. And boy, do we have a lot to learn.

Despite billions of dollars spent on backup power batteries and generators, power failures at major datacenters are not uncommon — just ask Netflix — so this is a real issue. Given proprietary Flash Translation Layers (FTL), there's no easy way to understand SSD behavior without testing.

In Understanding the Robustness of SSDs under Power Fault (PDF), researchers Mai Zheng and Feng Qin of Ohio State and Mark Lillibridge and Joseph Tucek of HP Labs look at how power faults affect flash-based SSDs. Short answer: It's not pretty.

The research

The team developed hardware to inject power faults and software to stress devices and check post-fault consistency. These were used to check 15 different SSDs and two hard drives.

The authors looked for several types of errors, including bit corruption, shorn writes, metadata corruption, and dead (bricked) devices. Write data was configured to enable detection of these and other errors.

Three workloads — concurrent random writes, concurrent sequential writes, and single-threaded sequential writes — maximized the SSD's internal workloads. SSDs have several background tasks, such as garbage collection, running constantly to keep the SSD ready and organized.

Tested SSDs

15 different SSDs — 10 different models from five vendors — were tested. Prices ranged from 63¢/GB to $6.50/GB using both MLC and SLC flash. Two hard drives, one low end and one high end, were also tested.

Vendor names were not revealed.

Results

The good news: Of six expected failures, only five were observed; and two of the devices behaved as expected. The bad news: 13 of the devices had poor failure behavior.

Every failed device lost some amount of data or became massively corrupted under power faults.

Bit corruption hit three devices; three had shorn writes; eight had serializability errors; one device lost one third of its data; and one SSD bricked. The low-end hard drive had some unserializable writes, while the high-end drive had no power fault failures.

The two SSDs that had no failures? Both were MLC 2012 model years with a mid-range — $1.17/GB — price.

The Storage Bits take

Because it is persistent, storage is the hardest part of IT infrastructure. There are myriad ways data gets scrambled.

This paper reminds us that SSDs are very new technology, with idiosyncrasies still being engineered around. We're still five years away from the average enterprise SSD being as reliable as the average enterprise hard drive is today.

Home and small office SSD users would be wise to have a battery backup on critical servers and desktops. Notebooks, of course, already have a battery backup.

Comments welcome, as always. The paper was presented at FAST 13. Have you seen any power-related SSD problems?

Topics: Storage, Data Centers, Emerging Tech, Hardware, Outage, Disaster Recovery

About

Robin Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

35 comments
Log in or register to join the discussion
  • I have not seen any problems, ...

    but then I am still using spinning disks. ;-)

    Thanks for the heads up. I will put off any SSD upgrades a bit longer. I still run notebooks on mains power with no battery fairly often.
    D.T.Long
  • Had that happen several times with USB flash drives

    Power surges, and not unmounting the drive from the task bar (yeah, plug and play my ass)...the data will corrupt at some point and you'll lose something.

    I downloaded Microsoft Synctoy and now I regularly back up my data on two different hard drives on my network in addition to my thumb drive, now I have three points of failure instead of just the one..saved my bacon more than one time!!
    toddbottom7
    • I also back up to two hard drives...

      But I got to worrying about a fire or something wiping them (and the computer) all out at once, so last year I started putting my critical data onto an encrypted virtual drive using TrueCrypt and backing that up to the cloud (Google does it automatically). I'm no expert, so I'm just having to hope that there's no downside to doing that.
      nfordzdn
    • plug/play

      i've never had problem as long as i set the device type to use 'immediate writes', its a tad slower than cached writes on huge data, but not much slower than what the devices themselves are limited by.
      Nancy Smith
      • write cache

        The MS write cache has always been a problem to us. When we install our proprietary application we turn off cached writes to increase data integrity.
        mswift@...
        • Which OS was uded for these tests?

          Didn't see any mention of the OS this testing was done under.
          ldillon5
          • OS

            The paper says "Debian Linux 6.0 with Kernel 2.6.32."
            ac3_z
    • Hardly the same issue

      It's "Plug and Play", not "Unplug and Play"; two completely different issues at hand here. So please spare me the "my ass" comments.

      Typically, flash drives are configured for "Quick Removal", which should write the cached data as soon as possible; if you unplug during this process, nobody but yourself is to blame. Waiting until the activity LED on the flash drive is of for a few seconds, then unplugging it should not cause harm. If there is not activity light on the drive, then I would always use the task bar removal option, just to be safe.
      A little bit more complicated are portable USB hard drives; they are not optimized for quick removal, so it may take a while until the cached data is written. In that case, the "Safely remove hardware" function should always be used.

      Regardless, none of those issues have anything to do with the design of "Plug and Play", which is only supposed to make hardware detection/installation easier; it is not meant for hot-unplug operations at all!
      JP-1973
      • Exactly

        Even if the light isn't on the USB drive unmount it using the Task Bar....I haven't had a corruption since starting to do that all the time.

        The problem will occur occasionally that it won't allow you to unmount because something is tying up the drive even though it's not showing as active....make sure you don't have any open docs or apps associated with the drive.
        toddbottom7
      • Except that's not the failure he's talking about...

        He's talking about *power failure*, not unexpectedly disconnecting the drive (although that's a good simulation of the conditions in question.

        Also, he's not talking about simple data corruption (which NTFS filesystems can actually handle rather well since they journal the changes and can reconstruct the correct state from the secondary copies of the MFT and other tables - same with extended HFS on Macs) he's talking about the actual SSD being damaged - and unless you're doing something pretty spectacular, powering off a hard drive really can't harm the drive - it just fast pulls the head back into park. So as long as you're not shaking the drive or banging it against the desk, power fails should be harmless to a real hard drive.
        TheWerewolf
        • he is talking about the same issue

          When you disconnect the USB drive, you remove power. If it does any operation while this happens, chances are it will die. Som drives handle this better than others.

          Nothing to do with file system data corruption, which is something else.
          danbi
          • USB connectors .......

            are designed to disconnect the signal first, then the power when unplugging (reverse when connecting).

            This will at least reduce the chances of damage.
            D.T.Long
          • Re: This will at least reduce the chances of damage.

            That only reduces the chance of hardware damage, nothing more.
            ldo17
  • Having written NVM device drivers for safe applications...

    Low power and loss of power while writing to the NVM chip are one of the biggest headaches I face. The worst is when you still have 200ms of data to write out (typically due to poor app level design/resource management) and you have only 50ms to write out critical level data and shut everything down orderly.

    I have found you do not want to have any pending operations (clocking in data or a page write in progress) when power finally does leave. Once in a while (not often but enough to cause pain), you will get all sorts of nasty stuff happen.
    Bruizer
  • Just had an experience...

    I sent my not-even-one-year-old Lenovo U300s to have the SSD replaced after it failed. Luckily I use Carbonite to back it up, or lots of files would have been obliterated in less than a second...
    Soapy Buoy
    • Yep

      I use Backblaze. No matter the tool, I tell everyone to backup automatically. It's not "if" a storage device will fail, it's "when" a storage device will fail.
      Regulator1956
      • That old saying......

        There are two types of drives. Those that HAVE failed and those that WILL fail.

        Act accordingly.
        D.T.Long
  • With any storage medium

    There is a risk of data loss. Agreed, we do have much to learn about the reliability of SSD's, but to me, the benefits they offer in portable devices outweigh the risks shown here. I've replaced my spinners with SSD on my netbook and laptop, and am delighted with the increase of battery life and speed. It will be a long time before I do that to my desktop machines, though. As always, no matter what type drive you are using, there are three words to remember: Backup, backup, backup!
    mike.motes@...
    • backup ^ 2

      yep, nothing feels better than knowing your hard info effort is stashed safely away. anticipating 'the big crash' or the 'flaming inferno' that consumes an entire business is bad enough to plan for, insurance and such, but the void left by having no link to business contact/history/or any sort of continuity is plainly like committing suicide.
      Nancy Smith
  • power faults??

    Power faults should not be an issue either in a battery run PC or in a server environment with proper power backup. So are the people who lost data not following best practices for power integrity?
    mswift@...