How does flash storage fail?

How does flash storage fail?

Summary: The flash failure mode is odd: when most things break you lose their contents. But when flash fails your data is still there. Did you ever wonder why?

SHARE:
11

The flash storage industry seems to have a code of silence on NAND flash technical issues. Search online for how NAND flash fails and you won't come up with much.

I recently did a video for LSI, the long time maker of RAID and SCSI controllers and adapters, in which I interviewed LSI system architect and corporate fellow Robert Ober, who holds dozens of patents and is deeply knowledgeable about storage technologies.

How does flash work?
Flash stores an electrical charge in a quantum well in a floating gate transistor. The floating gate name comes from the fact that the normal transistor gate is isolated from the source and drain by an insulating layer of oxide. The gate "floats" between the two insulating layers.

floating_gate_diagram
A floating gate NAND flash cell

 

The quantum well in the floating gate stores and electrical charge. In single level cell (SLC) flash either the absence or presence of charge gives us a single binary digit.

In multi level cell (MLC) flash there are four levels of charge corresponding to two binary digits. And in three level cell flash (TLC) there are eight levels of charge corresponding to three binary digits.

It takes about 20 V to write a flash cell. This voltage is created by on-chip pumps, which is why flash chips do not require a 20 V input.

With each write the high voltage places more charge into the insulating layers that protect the floating gate. As the charge in the insulating layers grows it takes longer and longer to write the cell.

Eventually, a write is no longer possible. When that happens the existing data can not be overwritten and is therefore preserved.

That's what flash "failure" looks like.

Other issues with flash
The video talks about more than how flash fails. For example, how should we think about the fact that flash is a wearing medium? How does write amplification work? What is the impact of data compression on write amplification?

These are among the other major issues addressed in the LSI video that you can watch here.

The Storage Bits take
Given the deep impact that NAND flash has had on the storage industry in the last five years, it is surprising about how little is generally known about the technology. Some of this is due to the desire to maintain trade secrets, but much of it has to do with a fear that knowledge would make users less comfortable about flash.

And users are uncomfortable about flash, though the discomfort is slowly dissipating with greater experience.

For example, early on the industry claimed that flash-based SSDs were much more reliable than disk drives. That wasn't completely true.

Yes, the most reliable SSDs are somewhat more reliable than HDD's, but vendors who threw together product's from spotmarket components turned out to be often much less reliable than disk drives.

Today we don't have any good alternatives to NAND flash, so the questions about how it works and how well it works are somewhat academic. But as new persistent storage technologies – such as resistance RAM technologies – come to market these issues will become more important to technical decision-makers.

Comments welcome, as always. LSI paid to create the video, but not for this post. What questions do you have about how flash works?

Topics: Storage, Hardware, Mobility

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

11 comments
Log in or register to join the discussion
  • Flash...

    I've got a Dell Mini 9 netbook with a 3rd party SSD. Worked fine for a while but then OS corruption started occurring on a very frequent basis. I suspect the SSD has gone bad.

    OTOH, I have a 30 Gb iPod that I bought Jan 3, 2006 (I'm looking at the receipt right now). I've dropped it a few times, left it plugged into the iPod adapter in my car one many a hot day, and still it keeps chugging along. 7 long years it has served me without fault. Never even had to do a system restore. I keep expecting to turn it on one day and see the sad mac face (or whatever an iPod does when the HD fails) but instead it just obediently comes to life and does its thing.

    Would I trade the flash memory in my various Android tablets and phone for a mechanical hard drive? Not on your life. I love being able to just throw the thing on my bed from across the room when I get home. But I do worry about flash memory life cycles.
    dsf3g
  • Write failures not so bad?

    Thanks for the interesting article.

    In thinking about this, it occurs to me that this type of failure may not be quite so bad. Maybe you could comment on it.

    I say that because I'm sure that if a write fails, the drive will mark the sector as bad and write the data someplace else on the drive. So the data eventually gets written, and the capacity of the drive shrinks by one sector. Certainly no disaster.

    The BAD kind of failure is if the drive cannot READ the data in a sector. That's the kind of failure that can be so frustrating on spinning hard disks. From your writeup it seems that the data can always be read from a sector, even if writes are no longer possible.

    So the main downside I see here is a slowly-shrinking amount of storage space. Am I missing something?
    Speednet
    • I was hoping Robin could respond...

      This was a serious question.
      Speednet
    • You are correct.

      Speednet, based on my discussions with insiders, the more serious problem than flash wear is entire plane failures. Each flash die (chip) has 2 planes and if one fails half the capacity is gone - as well as some bandwidth. That's why overprovisioning is important and some kind of RAID-like scheme is used to recover the lost data.

      As for shrinkage, the issue is that a few bad pages mean the entire block - 256KB or more - has to be retired. But the plane failure problem is much more serious from a capacity and data integrity standpoint.

      Robin

      PS I did answer this earlier, but somehow the comment never made it into the comments. Don't know why.
      R Harris
      • Thanks!

        Thanks, more interesting stuff! You're right when you say that people have no idea how the SSD storage works. I, for one, appreciate your inside info about it, since there is so much misinformation out there.
        Speednet
  • As far as I know...

    ...Speednet's observation is correct.

    What puzzles me is why there is no way to "drain" the insulating layers, and restore writeability.
    GrizzledGeezer
    • Annealing

      Apparently there is a way:
      http://spectrum.ieee.org/semiconductors/memory/flash-memory-survives-100-million-cycles
      lonniemcclure
      • Not commercially viable - and here's why

        Lonnie,

        I wrote about this (http://www.zdnet.com/self-healing-flash-for-infinite-life-7000008182/) last fall. As I noted in the Storage Bits take:

        "Macronix says that no commercial self-healing chip is imminent. And there's good economic reasons for that, despite the elegance of their design.

        NAND flash markets are dominated by the ferocious cost pressures of high volume consumer products. About 95% of the flash market doesn't care about self-healing flash, so there's no way to justify the extra cost of their solution.

        With well-engineered solutions, MLC flash can meet stringent enterprise requirements. The Macronix solution may become viable some day, but it isn't needed today."

        Nifty trick, but ReRAM is currently best positioned as the next gen of non-volatile memory.

        Robin
        R Harris
  • PROM becomes ROM

    Flash memory was originally developed to store settings and programs that rarely change, like BIOS settings that do not require the onboard button battery (although, naturally, the CLOCK still requires it), and BIOS programs that do not require replacing a chip. It has become so handy for standard read-write usage that we forget we are pushing it beyond its design limitations. A better analogy to a removable flash drive is a "reprogrammable" CD ROM which can be "patched" occasionally but otherwise retains its original data in a read only mode. If you use the flash drive in such a way that you SELDOM update it, it will last a lot longer. This is why defrag is not recommended, since defrag rewrites ALMOST ALL physical sectors without changing the LOGICAL contents at all.
    jallan32
  • Are we ready yet?

    Thanks for the article.

    "But as new persistent storage technologies – such as resistance RAM technologies – come to market these issues will become more important to technical decision-makers."

    - Do you actually believe that the alternative technologies (PCM, MRAM or NVDIMM ) will become a reality soon given a) they score low on price:performance b) lot of marketing push for NAND flash? I understand that the technology (like NVDIMM from Viking/Mircron ) provides far better performance than that of say PCIe SSD. But do customers need more performance? Are they willing to pay more? What are your thoughts?
    - Amit
    sahaamity
  • There is failure information out there

    Good article, but there's not a total code of silence. Here's an excellent document that discusses not only failure modes but ways to prevent them:
    http://www.spansion.com/Support/Application%20Notes/Overview_Embedded_System_Design_UG.pdf
    Disclaimer: I work for Spansion :>)
    timcarp1964