DRAM error rates: Nightmare on DIMM street

DRAM error rates: Nightmare on DIMM street

Summary: A two-and-a-half year study of DRAM on 10s of thousands Google servers found DIMM error rates are hundreds to thousands of times higher than thought -- a mean of 3,751 correctable errors per DIMM per year.This is the world's first large-scale study of RAM errors in the field.

SHARE:
TOPICS: Hardware, Processors
85

A two-and-a-half year study of DRAM on 10s of thousands Google servers found DIMM error rates are hundreds to thousands of times higher than thought -- a mean of 3,751 correctable errors per DIMM per year.

This is the world's first large-scale study of RAM errors in the field. It looked at multiple vendors, DRAM densities and DRAM types including DDR1, DDR2 and FB-DIMM.

Every system architect and motherboard designer should read it carefully.

If you can’t trust DRAM . . . Here are some hard numbers from DRAM Errors in the Wild: A Large-Scale Field Study by Bianca Schroeder, U of Toronto, and Eduardo Pinheiro and Wolf-Dietrich Weber of Google.

The Google servers use ECC DRAM that typically corrects single bit errors and reports double bit errors. It is a rare notebook or consumer desktop that supports ECC.

You could be having DRAM problems and not know it because even the system doesn't know.

Non-ECC DRAM is more common Most DIMMs don’t include ECC because it costs more. Without ECC the system doesn’t know a memory error has occurred.

Everything is fine until the data corruption means a missed memory reference or an incorrect value or a flipped bit in a file writing to disk. What you see is a “file not found” or a “file not readable” message or, worse yet, silent data corruption - or even a system crash. And nothing that says “memory error.”

Conventional Wisdom The industry take on DRAM is summed in a quote from an old AnandTech FAQ that took the industry at its word:

Everyone can agree that hard errors are fairly rare. . . . For the frequency of soft errors. . . . IBM stated . . . that at sea level, a soft error event occurs once per month of constant use in a 128MB PC100 SDRAM module. Micron has stated that it is closer to once per six months . . . .

An even bigger surprise: it appears that hard errors, not soft errors, are the dominant error mode - the reverse of the conventional wisdom.

Good news The study had several findings that are good news for consumers:

  • Temperature plays little role in errors - just as Google found with disk drives - so heroic cooling isn’t necessary.
  • The problem isn’t getting worse. The latest, most dense generations of DRAM perform as well, error wise, as previous generations.
  • Heavily used systems have more errors - meaning casual users have less to worry about.
  • No significant differences between vendors or DIMM types (DDR1, DDR2 or FB-DIMM). You can buy on price - at least for the ECC-type DIMMS they investigated.
  • Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems.

But something to think about for large-memory servers running, say, in-memory databases.

Bad news Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent. This means that some popular mobos have poor EMI hygiene. Route a memory trace too close to noisy component or shirk on grounding layers and instant error problems.

Hardware failures are much more common as well and may be the most common type of memory failure. Google replaces all DIMMs with hard errors - as do most data centers - as a matter of policy.

Other interesting findings For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform. There be lemons out there!

In more than 93% of the cases a machine that sees a correctable error experiences at least one more in the same year. They don’t get better by themselves.

High quality error correction codes are effective in reducing uncorrectable errors. There are “chip-kill” DIMM/mobo combinations that can detect and correct 4 bit errors, but few vendors make those.

Besides costing more, ECC DIMMs are about 3-5% slower than unprotected DIMMs. Few of us would ever notice that small a performance hit, but gamers might care.

The Storage Bits take You’d think that given the several decades of semiconductor DRAM usage that this study would be old news. I did.

Like most folks I accepted industry assurances that DRAM is reliable. My main machine today uses power-hungry fully-buffered ECC DIMMs.

But I was surprised when I checked out my memory section of "About this Mac" and discovered that 1 of my 6 2GB DIMMs was reporting correctable memory errors. Time to see if the “lifetime” warranty means anything.

I suspect this is another example of the industry’s code of omerta. Big system vendors have scads of data on disk drives, DRAM, network adapters, OS and filesystem based on mortality and tech support calls, but do they share this with the consuming public? Nothing to see here folks, just move along.

Kudos to Google for doing the long-term research required for substantive results and then sharing those results with the rest of us. Data is what makes your computer YOUR computer, and it is worth protecting. Forking over a bit more for ECC mobos and DIMMs may be worth it for serious users.

I expect ECC systems will become a lot more popular in the years ahead.

Comments welcome, of course. Can someone please document how to access ECC error reporting on Windows and Linux machines too? Thanks.

Topics: Hardware, Processors

About

Robin Harris has been a computer buff for over 35 years and selling and marketing data storage for over 30 years in companies large and small.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

85 comments
Log in or register to join the discussion
  • Windows.

    Atleast From Vista has a Utility in Administrative tools in the control panel.
    jdbukis@...
    • Which utility? Care to enlighten us?

      Windows has many utilities in the Control Panel.

      FWIW, there are many transient faults in RAM, CPU and in supporting chipsets that result in corrupted data. Even alpha particles now have a measurable impact on computing correctness!

      No "memory tester" utility will help identify, fix and prevent such corruptions, and the smaller and denser chips become, the greater the likelihood that alpha-particle based collisions will corrupt our data.

      I expect additional shielding to be fabricated into the cases enclosing many forms of chip as we progress toward and beyond 22nm manufacturing processes.
      de-void-21165590650301806002836337787023
      • I dunno about Vista... but in 7...

        ... there's an entry for "Windows Memory Diagnostic" that he's probably referring to.

        But that appears to be one of those "Run at boot" test programs, and not something that keeps a running log, like the article was describing.
        Hallowed are the Ori
        • Of course not.

          It's impossible to detect a memory error without ECC RAM, and even then it only appears in the Windows Event Log. If there is a bit flipped that is executable (like a program bit) your program will crash or the system will, if its a resource (pictures etc, things that programs load) then they will just come out with slight imperfections. It's one but, a 1GB Ram chip has 1073741824 of them.
          Soundcloud GenthenaZero
          • Typo

            *Its one bit
            Soundcloud GenthenaZero
    • WIndows has no such program.

      [b]
      [/b]
      AzuMao
  • Where in "About this Mac"?

    I'd just like to make sure I know the right place to look for
    errors. I look under "Hardware/Memory" and see a list of all
    8 DIMM slots. There's a column for Size, Type, Speed and
    Status, along with an expanded window beneath for the
    selected slot. I don't see any error count anywhere (and I'm
    set up to show "Full Profile"). Am I looking in the right place?
    MC_z
    • Click on Memory under Hardware...

      Click on Memory under Hardware. It'll list all the RAM slots and what type of memory is there. Where it says "Status," if there are any errors, it'll show there. Otherwise it'll just report back "OK."
      olePigeon
      • Thanks

        I was looking in the right place after all. Neither my home 8-core (DDR3) or my work 8-core (DDR2) appears to have encountered any unrepairable errors.

        I say 'unrepairable' since I assume anything corrected in the ECC logic won't show up in the system log.
        MC_z
        • Might want to check if you have ECC memory

          As the subject says, if you don't have ECC memory, you'll never know you had an error until it bites you. As one client of mine found out the hard way when his low cost but great specs server came with standard memory rather than ECC memory.

          The system log will show you if an error was detected and corrected (single bit) or just detected but uncorrectable (multi-bit).
          DNSB
        • Here's what a Mac ECC error looks like

          From my Mac Pro's About This Mac->Hardware->Memory:

          DIMM Riser A/DIMM 1:

          Size: 2 GB
          Type: DDR2 FB-DIMM
          Speed: 667 MHz
          Status: ECC Errors
          ECC Correctable Errors: 1
          Manufacturer: 0x0000
          Part Number: 0x000000463732353642363145353636374600
          Serial Number: 0x00000000

          Note the status and the number of correctable errors.

          Robin
          R Harris
    • Mac ECC

      The About This Mac status column under Hardware -> Memory says
      either "empty" "OK" or "ECC errors." I'm checking to see if there is more
      detail in the system logs.

      Busy today but I hope I can get more info up late this afternoon.

      Robin
      R Harris
  • If you're a gamer, you'd take the faster cheaper dimm

    If you're a gamer, you'd take the faster cheaper dimm and
    probably run it over-volted and overclocked. An error
    just means that you'll need to reboot at worst.
    georgeou
    • Nope!

      It means you'll lose all your save game data in
      addition to needing to reinstall Windows, at
      worst.

      That's right. Because all the disk I/O functions
      are ran from RAM, just like all software. One bit
      corrupted in one of them and it could trash all of
      the data on your hard drive.
      AzuMao
      • Riiiiight

        Need to re-install Windows due to a memory error? Wow, are you way off base. A memory error does not tranlate to a disk error specifically. The only way an error would have any remote affect on the OS is if the registry was in the memory region where the error occured, and a write to disk happened immediately after. Very unlikely.

        The original post was correct. Just a reboot, and you're back in the game.
        Narg
        • Wroooooong

          The way you access files (the FAT in Windows) lives in RAM. If you change a file in any significant way, the FAT is written back to the hard drive. If a single bit of that memory is corrupted, your file system may be toast.

          Your MoBo and O/S may know enough to avoid this if ECC catches the problem or even if simple parity memory indicates there's an issue. But if not, a memory error cause anything from nothing-much to man-am-I-screwed.
          MC_z
          • Wrong

            There are at least 2 FAT(FILE ALLOCATION TABLES) and it there is alot more involved in writting to the disk then one faulty bits... Lets do some maths.... First the files have CRC which have some amazing ability to check for errors (not correct but check) in the region of 99.9999999% then on top of that you have parity and other forms of bit correction, then you usally have firmware that allows only certain amounts of data to written at any one time without intervention IE crazy data just being written to will be short lived ... all up you may at worse lose a file or two which in most cases a specialised program maybe able to recover it ...

            PS Never lost a hard drive yet to a virus or data corruption or RAM .. People destroy their drives by not listening ... If you have a noisey drive REPLACE IT !!!
            seveprim@...
          • CRC

            Umm... hard drives use CRC and ECC. Back in the days of SASI interfaced hard drives, it was common to re-read a bad block several times before using the ECC error correction since it was damn slow.

            Unfortunately if the data supplied by your computers memory is bad, all your hard drive is going to do is to return that bad data correctly.

            GIGO is <b>NOT</b> short for garbage in, gospel out.

            As for FAT, that's a dying format on larger hard drives.
            DNSB
          • from nothing-much to man-am-I-screwed.

            And don't forget all that data Windows writes to disc prior to shutdown.
            Agnostic_OS
        • Did you even read my post? In fact; do you even know what a computer is?

          Operating systems, applications, etc, reside in
          RAM. These things are known as software.

          And the operating system is responsible for
          writing data to the storage drive, among many
          other software<->hardware interactions.

          If the routine that writes data becomes
          corrupted, it could get stuck in a loop that
          zeros out your entire hard drive when a single
          modification to a single file gets made,
          instead of just modifying that file.


          For example; in machine code, the difference
          between a jne and a je is only one single bit.
          In the worst case scenario this could make the
          routine keep looping indefinitely and wipe out
          the entire hard drive!
          AzuMao