DRAM error rates: Nightmare on DIMM street

By | October 4, 2009, 10:04pm PDT

A two-and-a-half year study of DRAM on 10s of thousands Google servers found DIMM error rates are hundreds to thousands of times higher than thought — a mean of 3,751 correctable errors per DIMM per year.

This is the world’s first large-scale study of RAM errors in the field. It looked at multiple vendors, DRAM densities and DRAM types including DDR1, DDR2 and FB-DIMM.

Every system architect and motherboard designer should read it carefully.

If you can’t trust DRAM . . .
Here are some hard numbers from DRAM Errors in the Wild: A Large-Scale Field Study by Bianca Schroeder, U of Toronto, and Eduardo Pinheiro and Wolf-Dietrich Weber of Google.

The Google servers use ECC DRAM that typically corrects single bit errors and reports double bit errors. It is a rare notebook or consumer desktop that supports ECC.

You could be having DRAM problems and not know it because even the system doesn’t know.

Non-ECC DRAM is more common
Most DIMMs don’t include ECC because it costs more. Without ECC the system doesn’t know a memory error has occurred.

Everything is fine until the data corruption means a missed memory reference or an incorrect value or a flipped bit in a file writing to disk. What you see is a “file not found” or a “file not readable” message or, worse yet, silent data corruption - or even a system crash. And nothing that says “memory error.”

Conventional Wisdom
The industry take on DRAM is summed in a quote from an old AnandTech FAQ that took the industry at its word:

Everyone can agree that hard errors are fairly rare. . . . For the frequency of soft errors. . . . IBM stated . . . that at sea level, a soft error event occurs once per month of constant use in a 128MB PC100 SDRAM module. Micron has stated that it is closer to once per six months . . . .

An even bigger surprise: it appears that hard errors, not soft errors, are the dominant error mode - the reverse of the conventional wisdom.

Good news
The study had several findings that are good news for consumers:

  • Temperature plays little role in errors - just as Google found with disk drives - so heroic cooling isn’t necessary.
  • The problem isn’t getting worse. The latest, most dense generations of DRAM perform as well, error wise, as previous generations.
  • Heavily used systems have more errors - meaning casual users have less to worry about.
  • No significant differences between vendors or DIMM types (DDR1, DDR2 or FB-DIMM). You can buy on price - at least for the ECC-type DIMMS they investigated.
  • Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems.

But something to think about for large-memory servers running, say, in-memory databases.

Bad news
Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent. This means that some popular mobos have poor EMI hygiene. Route a memory trace too close to noisy component or shirk on grounding layers and instant error problems.

Hardware failures are much more common as well and may be the most common type of memory failure. Google replaces all DIMMs with hard errors - as do most data centers - as a matter of policy.

Other interesting findings
For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform. There be lemons out there!

In more than 93% of the cases a machine that sees a correctable error experiences at least one more in the same year. They don’t get better by themselves.

High quality error correction codes are effective in reducing uncorrectable errors. There are “chip-kill” DIMM/mobo combinations that can detect and correct 4 bit errors, but few vendors make those.

Besides costing more, ECC DIMMs are about 3-5% slower than unprotected DIMMs. Few of us would ever notice that small a performance hit, but gamers might care.

The Storage Bits take
You’d think that given the several decades of semiconductor DRAM usage that this study would be old news. I did.

Like most folks I accepted industry assurances that DRAM is reliable. My main machine today uses power-hungry fully-buffered ECC DIMMs.

But I was surprised when I checked out my memory section of “About this Mac” and discovered that 1 of my 6 2GB DIMMs was reporting correctable memory errors. Time to see if the “lifetime” warranty means anything.

I suspect this is another example of the industry’s code of omerta. Big system vendors have scads of data on disk drives, DRAM, network adapters, OS and filesystem based on mortality and tech support calls, but do they share this with the consuming public? Nothing to see here folks, just move along.

Kudos to Google for doing the long-term research required for substantive results and then sharing those results with the rest of us. Data is what makes your computer YOUR computer, and it is worth protecting. Forking over a bit more for ECC mobos and DIMMs may be worth it for serious users.

I expect ECC systems will become a lot more popular in the years ahead.

Comments welcome, of course. Can someone please document how to access ECC error reporting on Windows and Linux machines too? Thanks.

Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Topics

Robin Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small.

Disclosure

Robin Harris

Robin Harris is a president of TechnoQWAN, a consulting and analyst firm in northern Arizona. He also writes StorageMojo.com, a blog which accepts advertising from companies in the storage industry, and has a 25 year history with IT vendors. He has many industry contacts, many of whom are friends and all of whom he has opinions about. Robin has relationships with many companies in the technology industry. Every company he writes about may have sought to influence his opinion through carefully-crafted marketing messages and self-serving white papers, gifts ranging from desk calendars, t-shirts, lunches and trips as well as analyst or consulting assignments. He also invests in some technology companies. He may accept payment for services in stock as well. Robin discloses financial investments in or client relationships with companies named in Storage Bits. To help readers sort out the gold from the dross in his writings, Robin tries to communicate his reasons as clearly as he can. If you agree, you are intelligent and discerning. If you disagree, well, you disagree. In all cases, Robin encourages readers to subject everything they read, see or hear on the internet or from politicians to some simple questions: * What assumptions are implicit in the world view and judgments of the author? * What, if any, is the factual basis for the opinions the author expresses? * Is it reasonable, logical and clear? Your critical faculties: use ‘em or lose ‘em!

Biography

Robin Harris

Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small. He introduced a couple of multi-billion dollar storage products (DLT, the first Fibre Channel array) to market, as well as a many smaller ones. Earlier he spent 10 years marketing servers and networks. After leaving corporate life he founded TechnoQWAN, a consulting and analyst firm. He also developed StorageMojo into one of the top storage industry blogs.

Robin writes, consults, coaches and lives among the mountains of northern Arizona.

Talkback Most Recent of 90 Talkback(s)

  • Windows.
    Atleast From Vista has a Utility in Administrative tools in the control panel.
    ZDNet Gravatar
    jdbukis@...
    5th Oct 2009
  • Which utility? Care to enlighten us?
    Windows has many utilities in the Control Panel.

    FWIW, there are many transient faults in RAM, CPU and in supporting chipsets that result in corrupted data. Even alpha particles now have a measurable impact on computing correctness!

    No "memory tester" utility will help identify, fix and prevent such corruptions, and the smaller and denser chips become, the greater the likelihood that alpha-particle based collisions will corrupt our data.

    I expect additional shielding to be fabricated into the cases enclosing many forms of chip as we progress toward and beyond 22nm manufacturing processes.
    ZDNet Gravatar
    de-void-21165590650301806002836337787023
    5th Oct 2009
  • I dunno about Vista... but in 7...
    ... there's an entry for "Windows Memory Diagnostic" that he's probably referring to.

    But that appears to be one of those "Run at boot" test programs, and not something that keeps a running log, like the article was describing.
    ZDNet Gravatar
    Hallowed are the Ori
    5th Oct 2009
  • ZDNet Gravatar
    AzuMao
    5th Oct 2009
  • Where in "About this Mac"?
    I'd just like to make sure I know the right place to look for
    errors. I look under "Hardware/Memory" and see a list of all
    8 DIMM slots. There's a column for Size, Type, Speed and
    Status, along with an expanded window beneath for the
    selected slot. I don't see any error count anywhere (and I'm
    set up to show "Full Profile"). Am I looking in the right place?
    ZDNet Gravatar
    MC_z
    5th Oct 2009
  • Click on Memory under Hardware...
    Click on Memory under Hardware. It'll list all the RAM slots and what type of memory is there. Where it says "Status," if there are any errors, it'll show there. Otherwise it'll just report back "OK."
    ZDNet Gravatar
    olePigeon
    5th Oct 2009
  • Thanks
    I was looking in the right place after all. Neither my home 8-core (DDR3) or my work 8-core (DDR2) appears to have encountered any unrepairable errors.

    I say 'unrepairable' since I assume anything corrected in the ECC logic won't show up in the system log.
    ZDNet Gravatar
    MC_z
    5th Oct 2009
  • Might want to check if you have ECC memory
    As the subject says, if you don't have ECC memory, you'll never know you had an error until it bites you. As one client of mine found out the hard way when his low cost but great specs server came with standard memory rather than ECC memory.

    The system log will show you if an error was detected and corrected (single bit) or just detected but uncorrectable (multi-bit).
    ZDNet Gravatar
    DNSB
    5th Oct 2009
  • ZDNet Blogger

    Here's what a Mac ECC error looks like
    From my Mac Pro's About This Mac->Hardware->Memory:

    DIMM Riser A/DIMM 1:

    Size: 2 GB
    Type: DDR2 FB-DIMM
    Speed: 667 MHz
    Status: ECC Errors
    ECC Correctable Errors: 1
    Manufacturer: 0x0000
    Part Number: 0x000000463732353642363145353636374600
    Serial Number: 0x00000000

    Note the status and the number of correctable errors.

    Robin
    ZDNet Gravatar
    R Harris
    5th Oct 2009
  • ZDNet Blogger

    Mac ECC
    The About This Mac status column under Hardware -> Memory says
    either "empty" "OK" or "ECC errors." I'm checking to see if there is more
    detail in the system logs.

    Busy today but I hope I can get more info up late this afternoon.

    Robin
    ZDNet Gravatar
    R Harris
    5th Oct 2009
  • If you're a gamer, you'd take the faster cheaper dimm
    If you're a gamer, you'd take the faster cheaper dimm and
    probably run it over-volted and overclocked. An error
    just means that you'll need to reboot at worst.
    ZDNet Gravatar
    georgeou
    5th Oct 2009
  • Nope!
    It means you'll lose all your save game data in
    addition to needing to reinstall Windows, at
    worst.

    That's right. Because all the disk I/O functions
    are ran from RAM, just like all software. One bit
    corrupted in one of them and it could trash all of
    the data on your hard drive.
    ZDNet Gravatar
    AzuMao
    5th Oct 2009
  • Riiiiight
    Need to re-install Windows due to a memory error? Wow, are you way off base. A memory error does not tranlate to a disk error specifically. The only way an error would have any remote affect on the OS is if the registry was in the memory region where the error occured, and a write to disk happened immediately after. Very unlikely.

    The original post was correct. Just a reboot, and you're back in the game.
    ZDNet Gravatar
    Narg
    5th Oct 2009
  • Wroooooong
    The way you access files (the FAT in Windows) lives in RAM. If you change a file in any significant way, the FAT is written back to the hard drive. If a single bit of that memory is corrupted, your file system may be toast.

    Your MoBo and O/S may know enough to avoid this if ECC catches the problem or even if simple parity memory indicates there's an issue. But if not, a memory error cause anything from nothing-much to man-am-I-screwed.
    ZDNet Gravatar
    MC_z
    5th Oct 2009
  • Wrong
    There are at least 2 FAT(FILE ALLOCATION TABLES) and it there is alot more involved in writting to the disk then one faulty bits... Lets do some maths.... First the files have CRC which have some amazing ability to check for errors (not correct but check) in the region of 99.9999999% then on top of that you have parity and other forms of bit correction, then you usally have firmware that allows only certain amounts of data to written at any one time without intervention IE crazy data just being written to will be short lived ... all up you may at worse lose a file or two which in most cases a specialised program maybe able to recover it ...

    PS Never lost a hard drive yet to a virus or data corruption or RAM .. People destroy their drives by not listening ... If you have a noisey drive REPLACE IT !!!
    ZDNet Gravatar
    seveprim@...
    5th Oct 2009

Talkback - Tell Us What You Think

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources