Data corruption at massive scale

Data corruption at massive scale

Summary: Data corruption is a fact of life for all systems - but only large systems get a statistically significant sample. Amazon, home of the world's largest cloud storage system, sees it all and concludes: trust nothing!

TOPICS: Hardware, CXO, Storage

While most computer users naively think - hope! - their data is safe, Internet-scale operations are all too aware of the reality.

Every time Storage Bits writes about data corruption, readers can be counted on to proclaim they've never seen it. But people's memories are lousy, and few realize how many ways data corruption can manifest itself.

Storage Bits has written about this many times (see DRAM error rates: Nightmare on DIMM street and How Microsoft puts your data at risk). In a recent post Amazon Web Services Vice President and Distinguished Engineer James Hamilton - formerly of Microsoft - writes about his experience with errors on high-scale systems.

To summarize his experience:

  • Hardware, software and firmware all introduce errors. ". . . absolutely nobody and nothing can be trusted."
  • More error detection is always better. Every time he's added more he's been ". . . amazed at the frequency the error correction code fired."
  • Corruption is everywhere. In one case he found latent data corruption on customer disks that was so bad that customers thought the software was buggy while the real problem was on-disk.
  • You absolutely need ECC on servers. And, he concludes ". . . ECC memory should be part of all client systems."

The Storage Bits take Good luck finding a notebook computer with ECC memory: the low-margin PC model - and consumer ignorance - ensures no such beast exists. But with notebooks supporting 16GB and more the need is real.

But as Mr. Hamilton notes, corruption exists at every level of the storage, network, and compute stack. File systems, drivers, disks, NICs, switches, DRAM and more.

And to all who are itching to tell us how you've never seen data corruption, ask yourself: how do you know? No BSOD? Never a missing DLL? No files that don't open? Downloads that fail? Never?

This is one instance where the consumerization of IT is taking us the wrong way: away from data integrity. Cheaply though.

The industry attitude is like Detroit in the 60s towards safety: data integrity will never sell. Yet today every car ad trumpets safety features.

In time, we can hope, every computer vendor - and ad - will do the same.

Comments welcome, of course. I'm glad massive buyers like Amazon are keeping vendors honest. And kudos to Microsoft for stepping up their game with ReFS. Too bad Apple is ignoring this reality.

Topics: Hardware, CXO, Storage

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Too bad Apple is ignoring this reality

    Since Apple is one of the leaders in pushing consumers to cloud computing, I expect that they are not ignoring it on their server farms. If things keep on the current trends, we will not be storing much data on our own computers.
    • I'd love some way to verify that

      I tend to agree with your supposition, but I'd love some way to verify it. Apple being Apple, I will probably have to live with disappointment on this one.

      Also, the article points out that it's not just the data storage point (iCloud or local HDD/SSD - same function) but it's the RAM in your computer, the on board cache, the switches at your ISP, everywhere.
    • The useful thing to remember is...

      ...that corporations tend to be focused on making money, which means it's up to us customers to make sure that they actually serve our interests, as well as their own.

      Sad but true.
      John L. Ries
  • At least storage is getting better...

    Modern enterprise drives are rolling out T10-DIF support, and most of the major storage vendors have some home-grown variant of it (DIF basically checksums your data all the way through the disk and onto storage - so you are at least warned if corruption happens).

    Additionally, most of the major enterprise vendors are moving away from reliance on simple raid and moving to double parity variants (RAID 6, NetApp's RAID-DP). While they don't stop corruption, such schemes are better at finding and repairing it. And with the new T-10 proposal for drives assisting RAID array rebuilds, recovering from corruption won't take as long as it currently does.

    Yes, DIMMs still suck, and all of my servers use ECC memory, but at least data at rest is getting better...