DRAM errors: from soft to hard

Every system uses dynamic random access memory (DRAM), but how good is it? Bad news: not nearly as good as vendors would like us to think. Good news: we're learning.

Research (see Nightmare on DIMM street) a few years ago found that DRAM error rates were hundreds to thousands of times higher than vendors had led us believe. But what is the nature of those errors? Are they soft errors - as is commonly believed - where a stray Alpha particle flips a bit? Or are they hard errors, where a bit gets stuck?

Errors soft and hard

If they're soft and random, there's not much we can do. But if they're hard, there may be things we can do to lessen their impact while operating more efficiently.

According to University of Toronto researchers (see Cosmic Rays Don’t Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Designby Andy Hwang, Ioan Stefanovici and Bianca Schroeder), who looked at tens of thousands of processors at Google and several national labs, hard errors are common, but their nature isn't binary either. Memory locations can become error prone, without being permanently stuck, perhaps sensitive to access patterns.

The research looked at 4 systems: IBM Blue Gene/L (BG/L) at LLNL; Blue Gene/P (BG/P) at Argonne National Laboratory; an HPC cluster at SciNet; and 20,000 Google servers. The Google systems weren't as well instrumented as the others, so some errors were conservatively estimated. The 2 most interesting details in the study:

  • Most errors are concentrated in the most used memory areas: where the key OS programs run.
  • Google declined to release the database on Google DRAM errors for other researchers. Why? I have some ideas.

How this affects you

In most consumer PCs - including all Macs except the soon-to-die Mac Pro - there is no DRAM error correction code (ECC). ECC memory costs more and vendors have learned that consumers don't care and won't pay for it.

BSOD or system lockup? Could have been a memory error in critical system code.

Workstations, servers and supercomputers commonly have some level of ECC, ranging from common detect-and-correct single bit errors, to much more sophisticated - and costly - "chipkill" modules that can survive the loss of an entire memory chip. When you're running a 6 month simulation on one of the world's most powerful computers with many terabytes of DRAM, you don't want a single chip failure to hose the job.

The Storage Bits take

Given the transient nature of even "hard" errors, few consumers will ever know they have a DRAM problem. If it only happens every few weeks they'll curse, reboot and forget it.

The issue is important to cloud providers though because they have the economic incentive to do something about it. When you've got millions of servers the money, energy and performance hit of ECC adds up.

It is telling that Google declined to release the data the research team collected. An obvious inference: they intend to use the data to improve their competitive position - and they want to be 1st to market.

Eventually, given Google's buying clout, the rest of us will benefit, if only in more reliable cloud services and better server memory designs.

But don't look for a fix in your next PC: vendors can't afford it.

Comments welcome, of course. I did a longer version of this post over at StorageMojo, if you'd like more details.