Many people reacted with disbelief to my recent series on data corruption (see How data gets lost, 50 ways to lose your data and How Microsoft puts your data at risk), claiming it had never happened to them. Really? Never had to reinstall an application, an OS, or had a file that wouldn't open?
Are you sure?
The research on silent data corruption has been theoretical or anecdotal, not statistical. But now, finally, some statistics are in. And the numbers are worse than I'd imagined.
Petabytes of on-disk data analyzed
At CERN, the world's largest particle physics lab, several researchers have analyzed the creation and propagation of silent data corruption. CERN's huge collider - built beneath Switzerland and France - will generate 15 thousand terabytes of data next year.
The experiments at CERN - high energy "shots" that create many terabytes of data in a few seconds - then require months of careful statistical analysis to find traces of rare and short-lived particles. Errors in the data could invalidate the results, so CERN scientists and engineers did a systematic analysis to find silent data corruption events.
Statistics work best with large sample sizes. As you'll see CERN has very large sample sizes.
The analysis looked at data corruption at 3 levels:
- Disk errors.The wrote a special 2 GB file to more than 3,000 nodes every 2 hours and read it back checking for errors for 5 weeks. They found 500 errors on 100 nodes.
- Single bit errors. 10% of disk errors.
- Sector (512 bytes) sized errors. 10% of disk errors.
- 64 KB regions. 80% of disk errors. This one turned out to be a bug in WD disk firmware interacting with 3Ware controller cards which CERN fixed by updating the firmware in 3,000 drives.
- RAID errors. They ran the verify command on 492 RAID systems each week for 4 weeks. The RAID controllers were spec'd at a Bit Error Rate of 10^14 read/written. The good news is that the observed BER was only about a 3rd of the spec'd rate. The bad news is that in reading/writing 2.4 petabytes of data there were some 300 errors.
- Memory errors. Good news: only 3 double-bit errors in 3 months on 1300 nodes. Bad news: according to the spec there shouldn't have been any. Only double bit errors can't be corrected.
All of these errors will corrupt user data. When they checked 8.7 TB of user data for corruption - 33,700 files - they found 22 corrupted files, or 1 in every 1500 files.
The bottom line
CERN found an overall byte error rate of 3 * 10^7, a rate considerably higher than numbers like 10^14 or 10^12 spec'd for components would suggest. This isn't sinister.
It's the BER of each link in the chain from CPU to disk and back again plus the fact that for some traffic, such as transferring a byte from the network to a disk, requires 6 memory r/w operations. That really pumps up the data volume and with it the likelihood of encountering an error.
The Storage Bits take
My system has 1 TB of data on it, so if the CERN numbers hold true for me I have 3 corrupt files. Not a big deal for most people today. But if the industry doesn't fix it the silent data corruption problem will get worse. In "Rules of thumb in data engineering" the late Jim Gray posited that everything on disk today will be in main memory in 10 years.
If that empirical relationship holds, my PC in 2017 will have a 1 TB main memory and a 200 TB disk store. And about 500 corrupt files. At that point everyone will see data corruption and the vendors will have to do something.
So why not start fixing the problem now?
Comments welcome, of course. Here's a link to the CERN Data Integrity paper. CERN runs Linux clusters, but based on the research Windows and Mac wouldn't be much different.