How SSD power faults scramble your data

How SSD power faults scramble your data

Summary: Flash SSDs are non-volatile, so what could go wrong when power fails? A great deal, even on high-end 'enterprise' SSDs.

SHARE:

We've got over 50 years of experience with spinning disks in all kinds of conditions, ranging from notebooks to massive big iron arrays. SSDs, not so much. And boy, do we have a lot to learn.

Despite billions of dollars spent on backup power batteries and generators, power failures at major datacenters are not uncommon — just ask Netflix — so this is a real issue. Given proprietary Flash Translation Layers (FTL), there's no easy way to understand SSD behavior without testing.

In Understanding the Robustness of SSDs under Power Fault (PDF), researchers Mai Zheng and Feng Qin of Ohio State and Mark Lillibridge and Joseph Tucek of HP Labs look at how power faults affect flash-based SSDs. Short answer: It's not pretty.

The research

The team developed hardware to inject power faults and software to stress devices and check post-fault consistency. These were used to check 15 different SSDs and two hard drives.

The authors looked for several types of errors, including bit corruption, shorn writes, metadata corruption, and dead (bricked) devices. Write data was configured to enable detection of these and other errors.

Three workloads — concurrent random writes, concurrent sequential writes, and single-threaded sequential writes — maximized the SSD's internal workloads. SSDs have several background tasks, such as garbage collection, running constantly to keep the SSD ready and organized.

Tested SSDs

15 different SSDs — 10 different models from five vendors — were tested. Prices ranged from 63¢/GB to $6.50/GB using both MLC and SLC flash. Two hard drives, one low end and one high end, were also tested.

Vendor names were not revealed.

Results

The good news: Of six expected failures, only five were observed; and two of the devices behaved as expected. The bad news: 13 of the devices had poor failure behavior.

Every failed device lost some amount of data or became massively corrupted under power faults.

Bit corruption hit three devices; three had shorn writes; eight had serializability errors; one device lost one third of its data; and one SSD bricked. The low-end hard drive had some unserializable writes, while the high-end drive had no power fault failures.

The two SSDs that had no failures? Both were MLC 2012 model years with a mid-range — $1.17/GB — price.

The Storage Bits take

Because it is persistent, storage is the hardest part of IT infrastructure. There are myriad ways data gets scrambled.

This paper reminds us that SSDs are very new technology, with idiosyncrasies still being engineered around. We're still five years away from the average enterprise SSD being as reliable as the average enterprise hard drive is today.

Home and small office SSD users would be wise to have a battery backup on critical servers and desktops. Notebooks, of course, already have a battery backup.

Comments welcome, as always. The paper was presented at FAST 13. Have you seen any power-related SSD problems?

Topics: Storage, Data Centers, Emerging Tech, Hardware, Outage, Disaster Recovery

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

23 comments
Log in or register to join the discussion
  • I have not seen any problems, ...

    but then I am still using spinning disks. ;-)

    Thanks for the heads up. I will put off any SSD upgrades a bit longer. I still run notebooks on mains power with no battery fairly often.
    D.T.Long
  • Having written NVM device drivers for safe applications...

    Low power and loss of power while writing to the NVM chip are one of the biggest headaches I face. The worst is when you still have 200ms of data to write out (typically due to poor app level design/resource management) and you have only 50ms to write out critical level data and shut everything down orderly.

    I have found you do not want to have any pending operations (clocking in data or a page write in progress) when power finally does leave. Once in a while (not often but enough to cause pain), you will get all sorts of nasty stuff happen.
    Bruizer
  • Just had an experience...

    I sent my not-even-one-year-old Lenovo U300s to have the SSD replaced after it failed. Luckily I use Carbonite to back it up, or lots of files would have been obliterated in less than a second...
    Soapy Buoy
    • Yep

      I use Backblaze. No matter the tool, I tell everyone to backup automatically. It's not "if" a storage device will fail, it's "when" a storage device will fail.
      Regulator1956
      • That old saying......

        There are two types of drives. Those that HAVE failed and those that WILL fail.

        Act accordingly.
        D.T.Long
  • With any storage medium

    There is a risk of data loss. Agreed, we do have much to learn about the reliability of SSD's, but to me, the benefits they offer in portable devices outweigh the risks shown here. I've replaced my spinners with SSD on my netbook and laptop, and am delighted with the increase of battery life and speed. It will be a long time before I do that to my desktop machines, though. As always, no matter what type drive you are using, there are three words to remember: Backup, backup, backup!
    mike.motes@...
    • backup ^ 2

      yep, nothing feels better than knowing your hard info effort is stashed safely away. anticipating 'the big crash' or the 'flaming inferno' that consumes an entire business is bad enough to plan for, insurance and such, but the void left by having no link to business contact/history/or any sort of continuity is plainly like committing suicide.
      Nancy Smith
  • power faults??

    Power faults should not be an issue either in a battery run PC or in a server environment with proper power backup. So are the people who lost data not following best practices for power integrity?
    mswift@...
  • lots more backup thats all

    Early HD days saw its share of issues too. I come to expect that of any new technology. Not going to skip it because I have a fear something might happen. If I did I would still be using the old B&W cellphones.

    Since partition cloning is so dang fast and reliable (i'm using ease todo), I'll just do more regular scheduled clones.
    rengek
  • Yes, this problem is in its infancy

    I totally agree that this problem will get worse. I think that the SSD's of today are a bit better but their is a lot of cheaper ones I seriously question their longevity. Even my Macbook Air I have had for a little over 3 years has been showing signs of memory corruption and if nothing else its response time has seriously slowed. Of course with a Macbook Air you get what you get and that's it. My current PC has a 7200rpm spinner and am quite happy with it. SSD's can wait for a while before I will try them again.
    jscott418-22447200638980614791982928182376
  • Not real world at all

    IMHO, their testing methodology is somewhat flawed because of how they cut the power to the devices that they were testing. Most computers use switching power supplies, and by their nature they don't drop power like falling off of a cliff. Instead, the power loss is much more gradual and of a longer duration. This may give the devices more time to complete their queued activities before power is fully lost.

    Test a standard PC switching power supply to see how long it maintains 5v and 12v rails after input power is cut and you will see what I am talking about.

    While I agree that there is some risk, I believe that is is far less than what their testing may lead one to conclude.
    corton
  • FTLs Just Cause More Problems Than They Solve

    As I understand it, the whole point of the FTL is to make the flash storage look more like a disk, so that it can be formatted with filesystems designed for disks.

    This is a bad idea. There are already log-structured filesystems (think conventional filesystem+journal, then get rid of the filesystem and just keep the journal), that are tailored to cope with the physical peculiarities of flash storage: using them directly, rather than through an FTL, would be so much more efficient and more reliable.

    Why aren't people doing this?
    ldo17
    • I thought the whole point of FTL...

      Was to get to other stars in a reasonable time span.
      TheWerewolf
    • FTL

      I can't claim to know all the technical details, but perhaps it's due to SATA and x86 standards of addressing via LBAs (Logical Block Addresses)? Perhaps there's no standardized/exposed method of dealing w/the flash blocks and pages directly? Or perhaps, nobody wants to expose it or force that extra complexity onto OS writers?

      Read http://www.anandtech.com/show/2738/5 and a bunch of pages after that. I read thru this long ago.
      ac3_z
      • Re: or force that extra complexity onto OS writers?

        The Linux kernel already includes several filesystems that are specifically designed to cope with the special characteristics of flash storage. So no, this is not some exotic new situation to OS writers.
        ldo17
        • Re: the Linux kernel

          These are intended for embedded devices, where you know exactly the layout and specification of the flash storage. None of this is workable for mainstream drives, that might change these things with each firmware change.

          Most importantly, these drives are intended as drop-in replacements for spinning disks, so they have to emulate LBA addressing.

          Further, these direct flash filesystems are designed for mostly read-only environments in embedded systems. None of these filesystems has received nearly as much testing and bug fixes as even the most obscure mainstream filesystem.

          On the other hand, it might be that already vendors of mass produced embedded systems like Apple have tweaked their UFS+ file system to directly deal with the underlying flash storage -- without using any kind of LBA translating disk controller. But they definitely get to chose the flash controller and the storage cells layout.
          danbi
          • Re: intended as drop-in replacements for spinning disks

            Precisely my point: they shouldn't be.
            ldo17
  • Thanks Robin

    This probably explains why my friend's laptop is screwed up.

    He has some weird power fault (if his laptop flexes, it switches off) and his SSD and OS are corrupted.
    It's only 18 months old.
    lehnerus2000
  • easy solution

    Just as with any storage media, use ZFS and have good frequent backups.
    danbi
  • ssd

    I have had 2 ssd failures after power failures. 1 Crucial M4 which lost all data and could not be recognized by the BIOS and a Samsung 128 GB that corrupted the boot sector but did not lose data. Luckily I had backup.
    yagijd