Data corruption at massive scale

Data corruption is a fact of life for all systems - but only large systems get a statistically significant sample. Amazon, home of the world's largest cloud storage system, sees it all and concludes: trust nothing!

While most computer users naively think - hope! - their data is safe, Internet-scale operations are all too aware of the reality.

Every time Storage Bits writes about data corruption, readers can be counted on to proclaim they've never seen it. But people's memories are lousy, and few realize how many ways data corruption can manifest itself.

Storage Bits has written about this many times (see DRAM error rates: Nightmare on DIMM street and How Microsoft puts your data at risk). In a recent post Amazon Web Services Vice President and Distinguished Engineer James Hamilton - formerly of Microsoft - writes about his experience with errors on high-scale systems.

To summarize his experience:

  • Hardware, software and firmware all introduce errors. ". . . absolutely nobody and nothing can be trusted."
  • More error detection is always better. Every time he's added more he's been ". . . amazed at the frequency the error correction code fired."
  • Corruption is everywhere. In one case he found latent data corruption on customer disks that was so bad that customers thought the software was buggy while the real problem was on-disk.
  • You absolutely need ECC on servers. And, he concludes ". . . ECC memory should be part of all client systems."

The Storage Bits take Good luck finding a notebook computer with ECC memory: the low-margin PC model - and consumer ignorance - ensures no such beast exists. But with notebooks supporting 16GB and more the need is real.

But as Mr. Hamilton notes, corruption exists at every level of the storage, network, and compute stack. File systems, drivers, disks, NICs, switches, DRAM and more.

And to all who are itching to tell us how you've never seen data corruption, ask yourself: how do you know? No BSOD? Never a missing DLL? No files that don't open? Downloads that fail? Never?

This is one instance where the consumerization of IT is taking us the wrong way: away from data integrity. Cheaply though.

The industry attitude is like Detroit in the 60s towards safety: data integrity will never sell. Yet today every car ad trumpets safety features.

In time, we can hope, every computer vendor - and ad - will do the same.

Comments welcome, of course. I'm glad massive buyers like Amazon are keeping vendors honest. And kudos to Microsoft for stepping up their game with ReFS. Too bad Apple is ignoring this reality.


You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
See All
See All