DRAM error rates: Nightmare on DIMM street

A two-and-a-half year study of DRAM on 10s of thousands Google servers found DIMM error rates are hundreds to thousands of times higher than thought -- a mean of 3,751 correctable errors per DIMM per year.This is the world's first large-scale study of RAM errors in the field.
Written by Robin Harris, Contributor

A two-and-a-half year study of DRAM on 10s of thousands Google servers found DIMM error rates are hundreds to thousands of times higher than thought -- a mean of 3,751 correctable errors per DIMM per year.

This is the world's first large-scale study of RAM errors in the field. It looked at multiple vendors, DRAM densities and DRAM types including DDR1, DDR2 and FB-DIMM.

Every system architect and motherboard designer should read it carefully.

If you can’t trust DRAM . . . Here are some hard numbers from DRAM Errors in the Wild: A Large-Scale Field Study by Bianca Schroeder, U of Toronto, and Eduardo Pinheiro and Wolf-Dietrich Weber of Google.

The Google servers use ECC DRAM that typically corrects single bit errors and reports double bit errors. It is a rare notebook or consumer desktop that supports ECC.

You could be having DRAM problems and not know it because even the system doesn't know.

Non-ECC DRAM is more common Most DIMMs don’t include ECC because it costs more. Without ECC the system doesn’t know a memory error has occurred.

Everything is fine until the data corruption means a missed memory reference or an incorrect value or a flipped bit in a file writing to disk. What you see is a “file not found” or a “file not readable” message or, worse yet, silent data corruption - or even a system crash. And nothing that says “memory error.”

Conventional Wisdom The industry take on DRAM is summed in a quote from an old AnandTech FAQ that took the industry at its word:

Everyone can agree that hard errors are fairly rare. . . . For the frequency of soft errors. . . . IBM stated . . . that at sea level, a soft error event occurs once per month of constant use in a 128MB PC100 SDRAM module. Micron has stated that it is closer to once per six months . . . .

An even bigger surprise: it appears that hard errors, not soft errors, are the dominant error mode - the reverse of the conventional wisdom.

Good news The study had several findings that are good news for consumers:

  • Temperature plays little role in errors - just as Google found with disk drives - so heroic cooling isn’t necessary.
  • The problem isn’t getting worse. The latest, most dense generations of DRAM perform as well, error wise, as previous generations.
  • Heavily used systems have more errors - meaning casual users have less to worry about.
  • No significant differences between vendors or DIMM types (DDR1, DDR2 or FB-DIMM). You can buy on price - at least for the ECC-type DIMMS they investigated.
  • Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems.

But something to think about for large-memory servers running, say, in-memory databases.

Bad news Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent. This means that some popular mobos have poor EMI hygiene. Route a memory trace too close to noisy component or shirk on grounding layers and instant error problems.

Hardware failures are much more common as well and may be the most common type of memory failure. Google replaces all DIMMs with hard errors - as do most data centers - as a matter of policy.

Other interesting findings For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform. There be lemons out there!

In more than 93% of the cases a machine that sees a correctable error experiences at least one more in the same year. They don’t get better by themselves.

High quality error correction codes are effective in reducing uncorrectable errors. There are “chip-kill” DIMM/mobo combinations that can detect and correct 4 bit errors, but few vendors make those.

Besides costing more, ECC DIMMs are about 3-5% slower than unprotected DIMMs. Few of us would ever notice that small a performance hit, but gamers might care.

The Storage Bits take You’d think that given the several decades of semiconductor DRAM usage that this study would be old news. I did.

Like most folks I accepted industry assurances that DRAM is reliable. My main machine today uses power-hungry fully-buffered ECC DIMMs.

But I was surprised when I checked out my memory section of "About this Mac" and discovered that 1 of my 6 2GB DIMMs was reporting correctable memory errors. Time to see if the “lifetime” warranty means anything.

I suspect this is another example of the industry’s code of omerta. Big system vendors have scads of data on disk drives, DRAM, network adapters, OS and filesystem based on mortality and tech support calls, but do they share this with the consuming public? Nothing to see here folks, just move along.

Kudos to Google for doing the long-term research required for substantive results and then sharing those results with the rest of us. Data is what makes your computer YOUR computer, and it is worth protecting. Forking over a bit more for ECC mobos and DIMMs may be worth it for serious users.

I expect ECC systems will become a lot more popular in the years ahead.

Comments welcome, of course. Can someone please document how to access ECC error reporting on Windows and Linux machines too? Thanks.

Editorial standards