DRAM errors: from soft to hard

DRAM errors: from soft to hard

Summary: Every system uses dynamic random access memory (DRAM), but how good is it? Bad news: not nearly as good as vendors would like us to think. Good news: we're learning.

TOPICS: Storage

Research (see Nightmare on DIMM street) a few years ago found that DRAM error rates were hundreds to thousands of times higher than vendors had led us believe. But what is the nature of those errors? Are they soft errors - as is commonly believed - where a stray Alpha particle flips a bit? Or are they hard errors, where a bit gets stuck?

Errors soft and hard

If they're soft and random, there's not much we can do. But if they're hard, there may be things we can do to lessen their impact while operating more efficiently.

According to University of Toronto researchers (see Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Designby Andy Hwang, Ioan Stefanovici and Bianca Schroeder), who looked at tens of thousands of processors at Google and several national labs, hard errors are common, but their nature isn't binary either. Memory locations can become error prone, without being permanently stuck, perhaps sensitive to access patterns.

The research looked at 4 systems: IBM Blue Gene/L (BG/L) at LLNL; Blue Gene/P (BG/P) at Argonne National Laboratory; an HPC cluster at SciNet; and 20,000 Google servers. The Google systems weren't as well instrumented as the others, so some errors were conservatively estimated. The 2 most interesting details in the study:

  • Most errors are concentrated in the most used memory areas: where the key OS programs run.
  • Google declined to release the database on Google DRAM errors for other researchers. Why? I have some ideas.

How this affects you

In most consumer PCs - including all Macs except the soon-to-die Mac Pro - there is no DRAM error correction code (ECC). ECC memory costs more and vendors have learned that consumers don't care and won't pay for it.

BSOD or system lockup? Could have been a memory error in critical system code.

Workstations, servers and supercomputers commonly have some level of ECC, ranging from common detect-and-correct single bit errors, to much more sophisticated - and costly - "chipkill" modules that can survive the loss of an entire memory chip. When you're running a 6 month simulation on one of the world's most powerful computers with many terabytes of DRAM, you don't want a single chip failure to hose the job.

The Storage Bits take

Given the transient nature of even "hard" errors, few consumers will ever know they have a DRAM problem. If it only happens every few weeks they'll curse, reboot and forget it.

The issue is important to cloud providers though because they have the economic incentive to do something about it. When you've got millions of servers the money, energy and performance hit of ECC adds up.

It is telling that Google declined to release the data the research team collected. An obvious inference: they intend to use the data to improve their competitive position - and they want to be 1st to market.

Eventually, given Google's buying clout, the rest of us will benefit, if only in more reliable cloud services and better server memory designs.

But don't look for a fix in your next PC: vendors can't afford it.

Comments welcome, of course. I did a longer version of this post over at StorageMojo, if you'd like more details.

Topic: Storage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • DRAM errors: from soft to hard

    nothing much can be done about it. mitigating this error by ecc can do so much but not more. even the best intent and well meaning desire of dram manufacturers to minimize errors in their product will not solve the problem in its entirety, but it is good to know that somebody somehow is trying to call everybodys attention to this vexing problem that had gone under the radar for a long time. true to form, the cloud will put this problem in the forefront because of the high stakes involve considering potential losses to not some but all consumers of cloud computing.
  • Robin, Mac Pro is not going to die; Cook confirmed that totally new ...

    Mac Pro will debut next year.
  • my main system's

    BIOS has inbuilt support for ECC. Though i don't have ECC modules installed, at least i have the option.

    It's not nearly as common to have a mainboard that is built to provide ECC, but if a person does have a ECC capable system - and access to ECC RAM modules, then by all means take advantage of it.
    • AMD has it

      Most if not all AMD processors (and many many motherboards for AMD processors) support ECCm you just need to buy the RAM. You can find good offers in the second-hand market.