DRAM errors in embedded systems

DRAM errors are just a problem for main memory: they also afflict embedded subsystems, like disk controllers, that use DRAM. Where are the problems and what can you do?

The fact that DRAM error rates are hundreds and thousands worse than the industry admitted has implications beyond your computer's main memory. Unprotected DRAM is in many embedded subsystems that your computer relies on - both local and across the Internet.

At any time DRAM errors can affect your data - without any indication that DRAM errors are responsible.

Let me count the ways DRAM is the fast and handy "scratch pad" memory embedded systems rely on to buffer data in process. Here are some common uses:

  • VRAM. Video cards with a gigabyte of memory are common - but who cares about a flipped bit on a frame of video? Bigger worry: the use of GPUs as general purpose multi-processors.
  • Network adapters. Network and user data is buffered in DRAM. If a network address is corrupted, data goes astray and retries hit network performance. In large, network-based supercomputers this problem has been already seen.
  • Storage controllers. Like network adapters, storage controllers buffer data in DRAM. Another reason for end-to-end checksums like ZFS has - and your file system doesn't.
  • Solid State Disks. NAND flash writes are slow and flash SSDs require frequent data rewrites, so designers use unprotected DRAM to buffer data in transit.

The Storage Bits take Data corruption is an all-to-common problem. Most of the the time though the corruption isn't labeled with a big red "Corrupted Data."

The corruption shows up as unreadable or lost files, lost network packets and connections, unexplained system crashes - anything but the true root cause.

What can users do?

Enterprise users should demand that all embedded systems in the data path use ECC memory. Anything less is NOT enterprise-class.

Personal users - like me - have fewer options. You can insist on systems that have ECC DRAM - good luck finding a notebook that does - but until system designers understand the issues and customers are willing to pay the costs, there is no fix for our rickety PC infrastructure.

Just live with it.

Comments welcome, of course.

Newsletters

You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
See All
See All