Data corruption is worse than you know

Data corruption is worse than you know

Summary: Many people reacted with disbelief to my recent series on data corruption (see How data gets lost, 50 ways to lose your data and How Microsoft puts your data at risk), claiming it had never happened to them. Really?

SHARE:
82

Many people reacted with disbelief to my recent series on data corruption (see How data gets lost, 50 ways to lose your data and How Microsoft puts your data at risk), claiming it had never happened to them. Really? Never had to reinstall an application, an OS, or had a file that wouldn't open?

Are you sure? The research on silent data corruption has been theoretical or anecdotal, not statistical. But now, finally, some statistics are in. And the numbers are worse than I'd imagined.

Petabytes of on-disk data analyzed At CERN, the world's largest particle physics lab, several researchers have analyzed the creation and propagation of silent data corruption. CERN's huge collider - built beneath Switzerland and France - will generate 15 thousand terabytes of data next year.

The experiments at CERN - high energy "shots" that create many terabytes of data in a few seconds - then require months of careful statistical analysis to find traces of rare and short-lived particles. Errors in the data could invalidate the results, so CERN scientists and engineers did a systematic analysis to find silent data corruption events.

Statistics work best with large sample sizes. As you'll see CERN has very large sample sizes.

The program The analysis looked at data corruption at 3 levels:

  • Disk errors.The wrote a special 2 GB file to more than 3,000 nodes every 2 hours and read it back checking for errors for 5 weeks. They found 500 errors on 100 nodes.
    • Single bit errors. 10% of disk errors.
    • Sector (512 bytes) sized errors. 10% of disk errors.
    • 64 KB regions. 80% of disk errors. This one turned out to be a bug in WD disk firmware interacting with 3Ware controller cards which CERN fixed by updating the firmware in 3,000 drives.

  • RAID errors. They ran the verify command on 492 RAID systems each week for 4 weeks. The RAID controllers were spec'd at a Bit Error Rate of 10^14 read/written. The good news is that the observed BER was only about a 3rd of the spec'd rate. The bad news is that in reading/writing 2.4 petabytes of data there were some 300 errors.
  • Memory errors. Good news: only 3 double-bit errors in 3 months on 1300 nodes. Bad news: according to the spec there shouldn't have been any. Only double bit errors can't be corrected.

All of these errors will corrupt user data. When they checked 8.7 TB of user data for corruption - 33,700 files - they found 22 corrupted files, or 1 in every 1500 files.

The bottom line CERN found an overall byte error rate of 3 * 10^7, a rate considerably higher than numbers like 10^14 or 10^12 spec'd for components would suggest. This isn't sinister.

It's the BER of each link in the chain from CPU to disk and back again plus the fact that for some traffic, such as transferring a byte from the network to a disk, requires 6 memory r/w operations. That really pumps up the data volume and with it the likelihood of encountering an error.

The Storage Bits take My system has 1 TB of data on it, so if the CERN numbers hold true for me I have 3 corrupt files. Not a big deal for most people today. But if the industry doesn't fix it the silent data corruption problem will get worse. In "Rules of thumb in data engineering" the late Jim Gray posited that everything on disk today will be in main memory in 10 years.

If that empirical relationship holds, my PC in 2017 will have a 1 TB main memory and a 200 TB disk store. And about 500 corrupt files. At that point everyone will see data corruption and the vendors will have to do something.

So why not start fixing the problem now?

Comments welcome, of course. Here's a link to the CERN Data Integrity paper. CERN runs Linux clusters, but based on the research Windows and Mac wouldn't be much different.

Topics: Data Centers, Hardware

About

Robin Harris has been a computer buff for over 35 years and selling and marketing data storage for over 30 years in companies large and small.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

82 comments
Log in or register to join the discussion
  • Opportunity for software designers, maybe ...

    Maybe this is an opportunity for software designers to create and implement data integrity applications similar to the rise of security applications that are all the rage today.

    By and large, most software today does not have any data integrity logic built into it. Data is assumed to be correctly handled by the underlying I/O routines and the hardware. But we all know that there are data integrity problems with both the operating systems and the hardware, it's just that nobody ever does anything about it.

    Part of the reason is training and education; current computer science training doesn't spend much time on data integrity. When you talk to people trained in the '60s and '70s, this was a much higher priority. Storage mediums like cards, tape, and early disk systems were physically prone to errors (tape stretching, card jamming, disk head misalignment). Programmers learned and used techniques like parity and CRC checking, redundant data groups, and other data recovery algorithms. Today, everybody expects the OS or the hardware to do it all automatically.

    A big factor in the failure is the loose standard of the PC architecture. It's impossible for vendors to do rigorous integration testing of components and subsystems in today's markets, there are so many combinations and variants, and testing adds costs that people aren't willing to pay. The same is true for disk vendors. They are under incredible price pressures even as they roll out new technical innovations every few years. Size matters, quality is a distant second concern.

    Finally, Microsoft bears some of the blame for this problem, because they refuse to acknowledge it. I did a simple demonstration for our CIO awhile back, when he was deciding if we would use Windows or Unix/Linux for our server architecture. I set up a Windows server and a Unix server and ran our software to generate a load on the systems. Then I took a data CD, scratched it with a car key, and put it in the drive on the Windows server. Windows promptly locked up for several minutes while it tried to read the scratched disk, freezing all access to the software application. Then I put the scratched CD in the Unix system, and after a few seconds, a window popped up on the console informing us of an error reading the CD, while the system kept humming along. It took the CIO about 30 seconds to make the platform decision ...
    terry flores
    • Linux Performance Demonstration

      Terry,

      That sounds like a great demonstration. Could you post fuller details that some of us could replicate? I'd love to use your example when giving comparison examples to decision makers.
      yyuko@...
    • Windows lockup

      I was beginning to think that I was the only one left on the
      planet that gets annoyed by this nasty MS Windows habit.
      IBM's OS2 kept on working without this lockup.
      The problem is with MS. I don't know if this happens on Apples.
      kd5auq
    • Um...what did that prove?

      There are a hundred good reasons to choose Linux servers over Windows Server and as many for choosing Windows Server over Linux. Base the argument on real world, application specific reasons. Reading a scratched CD in no way qualifies for making a platform decision. Base your testing on system recovery, application performance, maintenance costs of the OS over time, acquisition cost and the future direction of your application's development environment. You also ignored the fact that most hardware vendors are NOT delivering CD drives into blade server environments.

      I also would have asked what you were doing performing maintenance on a fully-loaded "production" server in the first place. Bad change control practices to say the least.
      stevets32
      • One thing it proved...

        ... is that Windows Server can't deal with a predictable peripheral failure gracefully. This is true of all Windows implementations that I've delt with. From what I can tell, it's due to bad design decisions made at a time when the programmers working on the project were not thinking ahead to multiple simultanious apps. The message pump stalls, the interactive applications are starved. Non-interactive applications may (or may not) continue without noticing the problem -- the CPU time tends to be 90% idle when the apps hang.

        What it didn't show was more robust data protection. That's a function of better hardware support (eg, EDAC/ECL on RAM, byte parity on address/data busses) and more reliable storage systems (hardware RAID w/good parity/data ratio). All of this is more expensive, and some of it is slower than the alternatives. I suspect that Linux could handle this better than Windows.
        filker0
  • This is the background level

    I always thought the data doomsday scenario, the realistic Fight Club possibility, would not involve blowing up the computer centers. The data is backed up off site. They'd build new data centers and recover.

    The real doomsday scenario would be a virus that doesn't do anything bad to the host machine. Instead the virus replaces random characters here and there in the data stream. Every few hundred transactions it replaces a number in an SSN, or a digit in a name or an address, corrupts a database entry. Not enough to raise much alarm, most companies would chalk it up to human error.

    Now imagine that virus working over time, quietly corrupting little bits of data. Customer data, audit data, spreadsheets, letters, database entries. Data that's backed up through several layers and eventually moved off site. Imagine the weight of corruption over time. Imagine coming into work one day and getting hit with the sudden realization you couldn't verify that your customer records or history were accurate. Imagine your customers finding out.

    The IT doomsday scenario doesn't involve bombs or theft or hacking. The IT doomsday scenario is one where gigabytes of data are left untrustworthy and unverifiable. A deliberate attempt to do what Robin is showing happens accidentally anyway.
    Chad_z
    • not as bad as it sounds.

      Most organizations that backup, keep those backups for a period of time and some intervals permanently.
      That reduces the possibility of undetected data corruption without backup over the long view.

      However the department of homeland security has now flagged you as a potential terrorist for your post.
      shravenk
      • Re: not as bad as it sounds

        How is this "not as bad as it sounds?" Backing up data that is corrupted does not reduce the possibility of undetected data corruption. The archiving of the data and the discovery of the corruption after the archiving are two separate, independent events. If the data is corrupted at the time of the backup, the process of backing it up does not "uncorrupt" the data.

        If data corruption is discovered, how much of the archived data is verifiably correct without a pristine baseline from which to compare? How do you know when the corruption occurred and which archives are not affected?

        If random characters or numbers are being changed, these errors would be very difficult to notice. They would probably only be discovered by chance if data records were retrieved from the archives that contained obvious errors. The statistical probability of discovering those random errors is proportional to the size of the archive data set being examined. Smaller data sets, such as individual records chosen at random from the archive, would be much less likely to contain a random error.

        If you'll go back and reread the original article, the researchers did a comprehensive analysis involving huge data sets and a solid methodology for uncovering data errors. How many companies do you think go to those extremes? As the article also pointed out, with smaller data sets, those errors would probably be attributed to other more benign causes. If a virus was the cause, it may go undetected for a long, long time.
        ucf1985
  • Considering a CPU oscillating...

    2 billion times a second and something happening with nearly every cycle, for their not to be errors is amazing.

    Figure the odds. This kind of success rate is unheard of for anything.
    bjbrock
    • Success rate?

      Fact is that the article is pointing out that ANY error will cascade and ultimately result in the loss of integrity. I don't care how efficient the modern CPU is when I loose data.

      The reality is that we need to have error checking and correction built into hardware and software because neither can be trusted to be 100% correct 100% of the time. The idea behind computers and computing is that the results must be certain and stable. When they are not we are subjected to the concepts that military use, multiple systems all parallel processing and then a vote being carried out to determine the correct result.

      Data is not a cheap commodity and silent data corruption is extremely dangerous.

      We need to bring pressure to bear on the vendors and creators of our systems to make this a thing of the past.
      Technocrat@...
      • Whoa. Take a chill pill.

        I never said we didn't need good data. I simply pointed out an observation. And I still say it. There isn't another thing on this planet that can claim the success rate of the modern PC.

        Multiply 2 billion times 60 time 60 times 24. That's how many cycles happen a day. For there not to be errors is impossible. How you handle those errors is the question.

        So when are you going to bring the pressure to bear and how are you going to do it? And just how are we going to achieve perfection?

        Nothing in this world is perfect. When you figure out how to make it perfect I'm sure you'll be a rich man.
        bjbrock
        • Actually, living systems can claim a success rate far above modern PCs

          Our genetics contain a built-in backup copy in our chromosomes. Our cell replication mechanisms have error correcting features built in. Cells, tissues, and organs all have healing & duplication capabilities built into the system. Everyone of us gets data corruption at the cellular level that results in "cancer"; yet most of the time the body clears itself of the problem before we even notice it. It's really only relatively rare instances that someone actually develops a cancer that progresses to killing us. It's only because we have trillions of cells each and trillions of chances to develop cancer that we have as many that get that far in the first place.
          Dr_Zinj
  • RE: Data corruption is worse than you know

    Wow.. those are abysmal numbers.

    Last time I designed a SCSI based raid-5 subsystem, my goal was to acheive an error rate of 1 in 10^22 bits. Which was the undected error rate of the underlying disk drives.

    Enviromental/dyanmic stress testing verified the overall design to more than 1*10^17 bits with zero errors. (That technology was subsequently purchased by sun microsystems.)

    It's fairly obvious that modern quality control has taken a adverse turn for the worse.

    I suspect that this is just another artifact of Offshoring/H-1B/L-1 programs and the displacement of older, more seasoned US workers. Newbies and their management do not have the experience to grasp and embrace the concepts of quality control.
    thetruth_z
    • I think we are pushing...

      hardware to its limits. Whether it's aerial density or cycles per second or whatever benchmark you chose. Even software is becoming so bloated it's out of control.

      There are physical limits in the world we live in. Maybe we are reaching some of those limits in the IT world.
      bjbrock
      • Smaller, Better, Faster...CHEAPER...

        Excellent point...if you read these forums on a regular basis, it is painfully obvious that everyone wants a rock-solid 1, 2 or 5 TB locally attached 100% bomb-proof storage system for less than $100. It just isn't going to happen.

        Make the distinction between nice-to-have data and critical data and then stick your CRITICAL stuff on a Block-IO device like a true SAN solution. RAID is fine, but what does the 3rd letter stand for.

        Clear case of you get what you pay for and if you don't pay much, you aren't going to get much.
        stevets32
    • Quality is now defined ...

      Quality is now defined as "good enough to meet requirements" which
      can be pretty sloppy.
      kd5auq
    • You pretend to pay us, we pretend to work

      Bravo -- someone finally said it.

      Storage product manufacturers think moving development and testing to Bangalore is the way they can keep executive compensation at an all time high. This comes at the risk of product quality and the loss of core competencies in the organization, which the executive team is supposed to maintain.

      Most managers don't want to hear the mantra of product developers and testers over there; "You pretend to pay us, we pretend to work". This is why employee retension is less than 1 yr and catastrophic to the data storage industry that requires 6 month to a year training and mentoring to bring developers up to speed on storage technologies.

      For some reason, storage product manaufacturers' executive teams think high quality people in Bangalore are happy working for $25-30/hr in an area where a sub-standard apt. costs $2200/mo with an 11 month upfront deposit. I guess they should be happy about their life style and be grateful they are working for the data storage industry slave drivers from the while their executive task masters line their pockets with the fruits of their labors. (No I'm not a communist, socialist or pro union)

      The worse part about this, when companies have experienced people domestically, (I have personally seen this) they are let go because they have the personal integrity to push back against poor management decisions that adversely effect product quality. Combined with the fact they are making more than $70K/annually for working 55 hrs. every week, their compensation is considered too high. Can you imagine, someone wanting to get paid a descent wage that is not an exec ? How dare they !!!

      If that wasn't bad enough, experienced first and second level line managers no longer push back against management to protect their employees or the quality of the products. They?ve been beaten down so many times by upper management, many are just riding the wave to retirement. The rest are just poor managers or don't have the experience in management or have a background in data storage products to be effective. They are just trying to survive. Each day they show up to their offices and "get with the plan", blindly tap dancing to the beat of the task masters cracking their whips, hoping to avoid their head on the chopping block.

      Because of manager hiring practices and alienation of experienced talent in the decision processes, there are many poor management decisions. Out of frustration, many very talented people, both experienced and junior, are leaving - not just companies, but the data storage industry !!

      After 25 years in the business, regrettably, this is the first time I can say the data storage industry is in real trouble. We should look to the executive teams' mis-management as the cause. In my experience, at least domestically, developers and testers want to build high quality, low cost products, but poor management decisions prevent them from doing so.

      Yes, the data storage industry is in real trouble and no one is looking or listening.
      xfer_rdy
  • Microsoft scandisk errors

    I have had M$ Scandisk wipeout 1TB servers in one faital swoop. I had my server running just fine, I noticed that it was wanting to run scan disk, and stopped it from running (before it started) ran a complete backup then restarted the server and let M$ scandisk run like it wanted to. Windows desided that all the data on the drives was corrupt and deleted 95% of everything.

    I reformated the Drives and copied all the data back from the backup and it ran fine for 3 more years without an known data coruption.

    Thank God I had time to get a backup... I have had this happen on several different computer that I have worked on, restart the computer, scandisk starts running and deleating everything with out so much as a prompt...thanks Bill Gates for careing...
    Qlueless
    • What?

      You ran M$ Scandisk on a server? Was that Windows ME Server Edition or XP Pro DataCenter?

      That's not Gates doing it to you...it's you doing it to you.
      stevets32
    • What??? - you may need to go back to training

      I seriously doubt Mr. Gates/ Windows did that to you...
      ItsTheBottomLine