ie8 fix
Click Here

Data corruption is worse than you know

By | September 17, 2007, 9:01pm PDT

Many people reacted with disbelief to my recent series on data corruption (see How data gets lost, 50 ways to lose your data and How Microsoft puts your data at risk), claiming it had never happened to them. Really? Never had to reinstall an application, an OS, or had a file that wouldn’t open?

Are you sure?
The research on silent data corruption has been theoretical or anecdotal, not statistical. But now, finally, some statistics are in. And the numbers are worse than I’d imagined.

Petabytes of on-disk data analyzed
At CERN, the world’s largest particle physics lab, several researchers have analyzed the creation and propagation of silent data corruption. CERN’s huge collider - built beneath Switzerland and France - will generate 15 thousand terabytes of data next year.

The experiments at CERN - high energy “shots” that create many terabytes of data in a few seconds - then require months of careful statistical analysis to find traces of rare and short-lived particles. Errors in the data could invalidate the results, so CERN scientists and engineers did a systematic analysis to find silent data corruption events.

Statistics work best with large sample sizes. As you’ll see CERN has very large sample sizes.

The program
The analysis looked at data corruption at 3 levels:

  • Disk errors.The wrote a special 2 GB file to more than 3,000 nodes every 2 hours and read it back checking for errors for 5 weeks. They found 500 errors on 100 nodes.
    • Single bit errors. 10% of disk errors.
    • Sector (512 bytes) sized errors. 10% of disk errors.
    • 64 KB regions. 80% of disk errors. This one turned out to be a bug in WD disk firmware interacting with 3Ware controller cards which CERN fixed by updating the firmware in 3,000 drives.
  • RAID errors. They ran the verify command on 492 RAID systems each week for 4 weeks. The RAID controllers were spec’d at a Bit Error Rate of 10^14 read/written. The good news is that the observed BER was only about a 3rd of the spec’d rate. The bad news is that in reading/writing 2.4 petabytes of data there were some 300 errors.
  • Memory errors. Good news: only 3 double-bit errors in 3 months on 1300 nodes. Bad news: according to the spec there shouldn’t have been any. Only double bit errors can’t be corrected.

All of these errors will corrupt user data. When they checked 8.7 TB of user data for corruption - 33,700 files - they found 22 corrupted files, or 1 in every 1500 files.

The bottom line
CERN found an overall byte error rate of 3 * 10^7, a rate considerably higher than numbers like 10^14 or 10^12 spec’d for components would suggest. This isn’t sinister.

It’s the BER of each link in the chain from CPU to disk and back again plus the fact that for some traffic, such as transferring a byte from the network to a disk, requires 6 memory r/w operations. That really pumps up the data volume and with it the likelihood of encountering an error.

The Storage Bits take
My system has 1 TB of data on it, so if the CERN numbers hold true for me I have 3 corrupt files. Not a big deal for most people today. But if the industry doesn’t fix it the silent data corruption problem will get worse. In “Rules of thumb in data engineering” the late Jim Gray posited that everything on disk today will be in main memory in 10 years.

If that empirical relationship holds, my PC in 2017 will have a 1 TB main memory and a 200 TB disk store. And about 500 corrupt files. At that point everyone will see data corruption and the vendors will have to do something.

So why not start fixing the problem now?

Comments welcome, of course. Here’s a link to the CERN Data Integrity paper. CERN runs Linux clusters, but based on the research Windows and Mac wouldn’t be much different.

Kick off your day with ZDNet's daily e-mail newsletter. It's the freshest tech news and opinion, served hot. Get it.

Topics

Robin Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small.

Disclosure

Robin Harris

Robin Harris is a president of TechnoQWAN, a consulting and analyst firm in northern Arizona. He also writes StorageMojo.com, a blog which accepts advertising from companies in the storage industry, and has a 25 year history with IT vendors. He has many industry contacts, many of whom are friends and all of whom he has opinions about. Robin has relationships with many companies in the technology industry. Every company he writes about may have sought to influence his opinion through carefully-crafted marketing messages and self-serving white papers, gifts ranging from desk calendars, t-shirts, lunches and trips as well as analyst or consulting assignments. He also invests in some technology companies. He may accept payment for services in stock as well. Robin discloses financial investments in or client relationships with companies named in Storage Bits. To help readers sort out the gold from the dross in his writings, Robin tries to communicate his reasons as clearly as he can. If you agree, you are intelligent and discerning. If you disagree, well, you disagree. In all cases, Robin encourages readers to subject everything they read, see or hear on the internet or from politicians to some simple questions: * What assumptions are implicit in the world view and judgments of the author? * What, if any, is the factual basis for the opinions the author expresses? * Is it reasonable, logical and clear? Your critical faculties: use ‘em or lose ‘em!

Biography

Robin Harris

Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small. He introduced a couple of multi-billion dollar storage products (DLT, the first Fibre Channel array) to market, as well as a many smaller ones. Earlier he spent 10 years marketing servers and networks. After leaving corporate life he founded TechnoQWAN, a consulting and analyst firm. He also developed StorageMojo into one of the top storage industry blogs.

Robin writes, consults, coaches and lives among the mountains of northern Arizona.

82
Comments

Join the conversation!

Just In

ZFS fixes these problems. Your data is totally safe.
Orvar 11th Dec 2009
ZFS fixes these problems. Read this short article to get more information about this problem:
http://queue.acm.org/detail.cfm?id=1317400
0 Votes
+ -
Maybe this is an opportunity for software designers to create and implement data integrity applications similar to the rise of security applications that are all the rage today.

By and large, most software today does not have any data integrity logic built into it. Data is assumed to be correctly handled by the underlying I/O routines and the hardware. But we all know that there are data integrity problems with both the operating systems and the hardware, it's just that nobody ever does anything about it.

Part of the reason is training and education; current computer science training doesn't spend much time on data integrity. When you talk to people trained in the '60s and '70s, this was a much higher priority. Storage mediums like cards, tape, and early disk systems were physically prone to errors (tape stretching, card jamming, disk head misalignment). Programmers learned and used techniques like parity and CRC checking, redundant data groups, and other data recovery algorithms. Today, everybody expects the OS or the hardware to do it all automatically.

A big factor in the failure is the loose standard of the PC architecture. It's impossible for vendors to do rigorous integration testing of components and subsystems in today's markets, there are so many combinations and variants, and testing adds costs that people aren't willing to pay. The same is true for disk vendors. They are under incredible price pressures even as they roll out new technical innovations every few years. Size matters, quality is a distant second concern.

Finally, Microsoft bears some of the blame for this problem, because they refuse to acknowledge it. I did a simple demonstration for our CIO awhile back, when he was deciding if we would use Windows or Unix/Linux for our server architecture. I set up a Windows server and a Unix server and ran our software to generate a load on the systems. Then I took a data CD, scratched it with a car key, and put it in the drive on the Windows server. Windows promptly locked up for several minutes while it tried to read the scratched disk, freezing all access to the software application. Then I put the scratched CD in the Unix system, and after a few seconds, a window popped up on the console informing us of an error reading the CD, while the system kept humming along. It took the CIO about 30 seconds to make the platform decision ...
0 Votes
+ -
Linux Performance Demonstration
yyuko@... 18th Sep 2007
Terry,

That sounds like a great demonstration. Could you post fuller details that some of us could replicate? I'd love to use your example when giving comparison examples to decision makers.
0 Votes
+ -
Windows lockup
kd5auq 18th Sep 2007
I was beginning to think that I was the only one left on the
planet that gets annoyed by this nasty MS Windows habit.
IBM's OS2 kept on working without this lockup.
The problem is with MS. I don't know if this happens on Apples.
0 Votes
+ -
Um...what did that prove?
stevets32 18th Sep 2007
There are a hundred good reasons to choose Linux servers over Windows Server and as many for choosing Windows Server over Linux. Base the argument on real world, application specific reasons. Reading a scratched CD in no way qualifies for making a platform decision. Base your testing on system recovery, application performance, maintenance costs of the OS over time, acquisition cost and the future direction of your application's development environment. You also ignored the fact that most hardware vendors are NOT delivering CD drives into blade server environments.

I also would have asked what you were doing performing maintenance on a fully-loaded "production" server in the first place. Bad change control practices to say the least.
0 Votes
+ -
One thing it proved...
filker0 18th Sep 2007
... is that Windows Server can't deal with a predictable peripheral failure gracefully. This is true of all Windows implementations that I've delt with. From what I can tell, it's due to bad design decisions made at a time when the programmers working on the project were not thinking ahead to multiple simultanious apps. The message pump stalls, the interactive applications are starved. Non-interactive applications may (or may not) continue without noticing the problem -- the CPU time tends to be 90% idle when the apps hang.

What it didn't show was more robust data protection. That's a function of better hardware support (eg, EDAC/ECL on RAM, byte parity on address/data busses) and more reliable storage systems (hardware RAID w/good parity/data ratio). All of this is more expensive, and some of it is slower than the alternatives. I suspect that Linux could handle this better than Windows.
0 Votes
+ -
This is the background level
Chad_z 18th Sep 2007
I always thought the data doomsday scenario, the realistic Fight Club possibility, would not involve blowing up the computer centers. The data is backed up off site. They'd build new data centers and recover.

The real doomsday scenario would be a virus that doesn't do anything bad to the host machine. Instead the virus replaces random characters here and there in the data stream. Every few hundred transactions it replaces a number in an SSN, or a digit in a name or an address, corrupts a database entry. Not enough to raise much alarm, most companies would chalk it up to human error.

Now imagine that virus working over time, quietly corrupting little bits of data. Customer data, audit data, spreadsheets, letters, database entries. Data that's backed up through several layers and eventually moved off site. Imagine the weight of corruption over time. Imagine coming into work one day and getting hit with the sudden realization you couldn't verify that your customer records or history were accurate. Imagine your customers finding out.

The IT doomsday scenario doesn't involve bombs or theft or hacking. The IT doomsday scenario is one where gigabytes of data are left untrustworthy and unverifiable. A deliberate attempt to do what Robin is showing happens accidentally anyway.
0 Votes
+ -
not as bad as it sounds.
shravenk 18th Sep 2007
Most organizations that backup, keep those backups for a period of time and some intervals permanently.
That reduces the possibility of undetected data corruption without backup over the long view.

However the department of homeland security has now flagged you as a potential terrorist for your post.
0 Votes
+ -
Re: not as bad as it sounds
ucf1985 20th Sep 2007
How is this "not as bad as it sounds?" Backing up data that is corrupted does not reduce the possibility of undetected data corruption. The archiving of the data and the discovery of the corruption after the archiving are two separate, independent events. If the data is corrupted at the time of the backup, the process of backing it up does not "uncorrupt" the data.

If data corruption is discovered, how much of the archived data is verifiably correct without a pristine baseline from which to compare? How do you know when the corruption occurred and which archives are not affected?

If random characters or numbers are being changed, these errors would be very difficult to notice. They would probably only be discovered by chance if data records were retrieved from the archives that contained obvious errors. The statistical probability of discovering those random errors is proportional to the size of the archive data set being examined. Smaller data sets, such as individual records chosen at random from the archive, would be much less likely to contain a random error.

If you'll go back and reread the original article, the researchers did a comprehensive analysis involving huge data sets and a solid methodology for uncovering data errors. How many companies do you think go to those extremes? As the article also pointed out, with smaller data sets, those errors would probably be attributed to other more benign causes. If a virus was the cause, it may go undetected for a long, long time.
0 Votes
+ -
Considering a CPU oscillating...
bjbrock 18th Sep 2007
2 billion times a second and something happening with nearly every cycle, for their not to be errors is amazing.

Figure the odds. This kind of success rate is unheard of for anything.
0 Votes
+ -
Success rate?
Technocrat@... 18th Sep 2007
Fact is that the article is pointing out that ANY error will cascade and ultimately result in the loss of integrity. I don't care how efficient the modern CPU is when I loose data.

The reality is that we need to have error checking and correction built into hardware and software because neither can be trusted to be 100% correct 100% of the time. The idea behind computers and computing is that the results must be certain and stable. When they are not we are subjected to the concepts that military use, multiple systems all parallel processing and then a vote being carried out to determine the correct result.

Data is not a cheap commodity and silent data corruption is extremely dangerous.

We need to bring pressure to bear on the vendors and creators of our systems to make this a thing of the past.
0 Votes
+ -
Whoa. Take a chill pill.
bjbrock 18th Sep 2007
I never said we didn't need good data. I simply pointed out an observation. And I still say it. There isn't another thing on this planet that can claim the success rate of the modern PC.

Multiply 2 billion times 60 time 60 times 24. That's how many cycles happen a day. For there not to be errors is impossible. How you handle those errors is the question.

So when are you going to bring the pressure to bear and how are you going to do it? And just how are we going to achieve perfection?

Nothing in this world is perfect. When you figure out how to make it perfect I'm sure you'll be a rich man.
Our genetics contain a built-in backup copy in our chromosomes. Our cell replication mechanisms have error correcting features built in. Cells, tissues, and organs all have healing & duplication capabilities built into the system. Everyone of us gets data corruption at the cellular level that results in "cancer"; yet most of the time the body clears itself of the problem before we even notice it. It's really only relatively rare instances that someone actually develops a cancer that progresses to killing us. It's only because we have trillions of cells each and trillions of chances to develop cancer that we have as many that get that far in the first place.
0 Votes
+ -
Wow.. those are abysmal numbers.

Last time I designed a SCSI based raid-5 subsystem, my goal was to acheive an error rate of 1 in 10^22 bits. Which was the undected error rate of the underlying disk drives.

Enviromental/dyanmic stress testing verified the overall design to more than 1*10^17 bits with zero errors. (That technology was subsequently purchased by sun microsystems.)

It's fairly obvious that modern quality control has taken a adverse turn for the worse.

I suspect that this is just another artifact of Offshoring/H-1B/L-1 programs and the displacement of older, more seasoned US workers. Newbies and their management do not have the experience to grasp and embrace the concepts of quality control.
0 Votes
+ -
I think we are pushing...
bjbrock 18th Sep 2007
hardware to its limits. Whether it's aerial density or cycles per second or whatever benchmark you chose. Even software is becoming so bloated it's out of control.

There are physical limits in the world we live in. Maybe we are reaching some of those limits in the IT world.
0 Votes
+ -
Smaller, Better, Faster...CHEAPER...
stevets32 18th Sep 2007
Excellent point...if you read these forums on a regular basis, it is painfully obvious that everyone wants a rock-solid 1, 2 or 5 TB locally attached 100% bomb-proof storage system for less than $100. It just isn't going to happen.

Make the distinction between nice-to-have data and critical data and then stick your CRITICAL stuff on a Block-IO device like a true SAN solution. RAID is fine, but what does the 3rd letter stand for.

Clear case of you get what you pay for and if you don't pay much, you aren't going to get much.
0 Votes
+ -
Quality is now defined ...
kd5auq 18th Sep 2007
Quality is now defined as "good enough to meet requirements" which
can be pretty sloppy.
0 Votes
+ -
Bravo -- someone finally said it.

Storage product manufacturers think moving development and testing to Bangalore is the way they can keep executive compensation at an all time high. This comes at the risk of product quality and the loss of core competencies in the organization, which the executive team is supposed to maintain.

Most managers don't want to hear the mantra of product developers and testers over there; "You pretend to pay us, we pretend to work". This is why employee retension is less than 1 yr and catastrophic to the data storage industry that requires 6 month to a year training and mentoring to bring developers up to speed on storage technologies.

For some reason, storage product manaufacturers' executive teams think high quality people in Bangalore are happy working for $25-30/hr in an area where a sub-standard apt. costs $2200/mo with an 11 month upfront deposit. I guess they should be happy about their life style and be grateful they are working for the data storage industry slave drivers from the while their executive task masters line their pockets with the fruits of their labors. (No I'm not a communist, socialist or pro union)

The worse part about this, when companies have experienced people domestically, (I have personally seen this) they are let go because they have the personal integrity to push back against poor management decisions that adversely effect product quality. Combined with the fact they are making more than $70K/annually for working 55 hrs. every week, their compensation is considered too high. Can you imagine, someone wanting to get paid a descent wage that is not an exec ? How dare they !!!

If that wasn't bad enough, experienced first and second level line managers no longer push back against management to protect their employees or the quality of the products. They?ve been beaten down so many times by upper management, many are just riding the wave to retirement. The rest are just poor managers or don't have the experience in management or have a background in data storage products to be effective. They are just trying to survive. Each day they show up to their offices and "get with the plan", blindly tap dancing to the beat of the task masters cracking their whips, hoping to avoid their head on the chopping block.

Because of manager hiring practices and alienation of experienced talent in the decision processes, there are many poor management decisions. Out of frustration, many very talented people, both experienced and junior, are leaving - not just companies, but the data storage industry !!

After 25 years in the business, regrettably, this is the first time I can say the data storage industry is in real trouble. We should look to the executive teams' mis-management as the cause. In my experience, at least domestically, developers and testers want to build high quality, low cost products, but poor management decisions prevent them from doing so.

Yes, the data storage industry is in real trouble and no one is looking or listening.
0 Votes
+ -
Microsoft scandisk errors
Qlueless 18th Sep 2007
I have had M$ Scandisk wipeout 1TB servers in one faital swoop. I had my server running just fine, I noticed that it was wanting to run scan disk, and stopped it from running (before it started) ran a complete backup then restarted the server and let M$ scandisk run like it wanted to. Windows desided that all the data on the drives was corrupt and deleted 95% of everything.

I reformated the Drives and copied all the data back from the backup and it ran fine for 3 more years without an known data coruption.

Thank God I had time to get a backup... I have had this happen on several different computer that I have worked on, restart the computer, scandisk starts running and deleating everything with out so much as a prompt...thanks Bill Gates for careing...
0 Votes
+ -
What?
stevets32 18th Sep 2007
You ran M$ Scandisk on a server? Was that Windows ME Server Edition or XP Pro DataCenter?

That's not Gates doing it to you...it's you doing it to you.
0 Votes
+ -
What??? - you may need to go back to training
ItsTheBottomLine 18th Sep 2007
I seriously doubt Mr. Gates/ Windows did that to you...
0 Votes
+ -
Huh?!?!
cornpie 18th Sep 2007
Who said anything about Microsoft or Windows? I rather doubt that CERN is running its multi-terrabytes of data on any system where you could run scandisk.
0 Votes
+ -
Problem solved already maybe ??? ... its called ZFS and its available for free with Solaris 10.
0 Votes
+ -
I don't think so
croberts 18th Sep 2007
If the error originates in memory, it will get written to the hard drive regardless of the file system used.
0 Votes
+ -
RE: Data corruption is worse than you know
rfrysztak@... 18th Sep 2007
Just because you MAY have 200Tb of storage in 2017 doesn't mean you will have 200Tb of information on your disc. Is it full now? Realistically, most people rarely if ever fill their hard drive, or fully utilize main memory. Your 500 corrupted files will likely be far less, and a catastrophic failure is much more likely than the "data leak" you speak of. By simply backing data up in more than one place (separate hard drive, data stick, etc), you will significantly reduce the possible error rate to any one individual file and give yourself peace of mind for the reliability of your data. What is true now will certainly be true then - back it up!
Bob
0 Votes
+ -
A While Back...
KenQ 18th Sep 2007
My first HDD was only 30 megabytes and I didn't fill it either. I can easily see 200 terabytes in 2017.
0 Votes
+ -
RE:A while back
GreyGeek 18th Sep 2007
Back around 1980ish I installed a 5MB Corvus HD on an Apple [] owned by my client, a parts store. The Corvus had to be turned on 15 minutes BEFORE the Apple so it could warm up and stabilize first.

He had been running with two Disk[]'s. My first thought was "When will he every fill this thing up?", thinking it would take years. It took only a year before I had to upgrade him to a 10MB IBM PC.

The more space you have the more data you want to keep. 1 TB HDs won't be big enough for laptops in a few years.
0 Votes
+ -
RE: Data corruption is worse than you know
markyannone 18th Sep 2007
Be thankful for that corruption. Do whatever you can to corrupt and invalidate the databases that contain your personal information. Start today. Heresy? Not any more. -Mark Yannone
0 Votes
+ -
So, TRON 2.0 plot is becoming reality!
0 Votes
+ -
If they DETECTED 3 double-bit memory errors on 1300 (of 3000?) nodes, how many went undetected? You cannot check the integrity of a disk if the test system is not of bulletproof.

Who wrote the test algorithm? Just because "scientists and engineers" performed tests using huge amounts of data does not make the tests valid.
The unepectedly high level of errors (according to specs which are based on currently known technology and scientific models) are not a problem for me.

This just demonstrate that we don't know everything and that something, still undetected in a measurable way, can now be seriously envisioned, and that new kind of detectors can be created. If we can estimate these errors, we have a reliable detector, and we can build technology to use this detection to improve our error detection rate.

Technology is not necessarily new hardware, it may just be software solutions monitoring the error detection rate, and adaptation to these conditions.

See how our computer hardware technologies are now very near from the Eisenberg uncertainty barrier, where the dual nature of matier (wave versus quantified corpuscule) starts appearing and limit our physical interation with it and our attempts to take isolated measures.

Due to this, the way to protect from these errors is not only trying to isolate the interactions (something that is becoming extremely difficult and costly to build), but increase the duplication of "reproducible" events so that if an "error" occurs somewhere, it will be measurable by comparing the results generated in other places.

The method used by CERN is interesting: it allows detecting those errors by multiplying the number of independant computing nodes.

I think this is the way to go, and something that is already occuring now: the power of a compuer is not only the power of a single core, but can be increased using multiple cores, and by using distributed computing grids, which could be integrated into a single compuer at lower cost than trying to increase the computing speed of a single core.

The future of computing technology is already traced: tomorrow's computer will not work isolately, but will live in the network. Tomorrow's power of computers is in the network, and in the number of nodes we can connect to it. May be in the future, we won't be able to use any compuer isolately, because the power of the network, as a distributed computing grid, will be needed...

But for now, the network itself is not enough reliable for its data transmission, to allow building very large reliable computing grids. We need technologies to improve the communication links and reduce the transmission error rates, with better error detection and correction systems.

Same thing about current RAID storage systems. May be the future of data storage is also in distributed storage, using the same improved networking technologies, for reliability.

So the stoarage capaicity of the network storage medium will not depend on the number of nodes connedted to it but in the number of independant connections that can be made in parallel between these nodes: this number of theoretical possible links grows much faster than the number of participating nodes.

Consider our brain: we can easily estimate a reliable number of neurones working in it. But the number of links is extremely difficult to estimate as it varies constantly. The "computing" power of our brain and its efficiency does not depend really on the number of neurones it contains, but in its capacity to create new synaptic connections.

This is now used medically, to improve the recovery after a local damage to a part of the brain, and explains why people that have been victims of accidents can recover almost all their skills, despite a part of their brain is permanently damaged. And its significant that children can learn lots of things in a short period of time, at a time where the creation of synaptic links is maximum. To recover after an accident, people take medications to improve the capacity of neurones to create new synaptic links.

We can do the same thing with our computers and networks, using the natural capacity of networks to adapt itself to errors.

In our society, we have similar things: the power of a group of persons is not just the sums of their skills and ability, it's much more; and the group can be more efficient and can adapt to the existency of deficient participants:the efficiency of the group comes from its capacity to create MORE links between its participants.

This explains why old organizational models using strict hierarchical structures are slow to adapt themselves and less inefficient, than organizations woking with a nearty flat hierarchical system where more interactions are allowed between members of the working team, at large. This explains also why democracies are more successful to adapt themselves to changes in environmental and political issues than dictatorships.
0 Votes
+ -
Or they have a bad test proceedure. I think it is irresponsible to come to the conclusion that all systems will have 1 error in 3x10^7 read/write cycles based on this single report.
0 Votes
+ -
then who's wrong?
PhilippeV 18th Sep 2007
So what can you conclude?
If the CERN is performing tests like this, it's because it needs to verify the assertions made by manufacturers to see if they remainvalid when handling larger problems than those for which those elements were designed.

I can make a parallel with something you canverify yourself:

Look at a checker board with white and black squares alternated. You can conclude that each square is either 100% black or 100% white. You could try to verify this fact by measuring the whiteness of each isolated square: when youdo that, by instpecting the state of each cell, you modify the problem with your measure. Then you conclude easily that each element is 100% white or 100% black with 0% of errors.

Now combine those squares in a new checkerboard where only half of the previously black cells are effectively black: use white cells instead. Now look at the resulting board: you will see "gray" cells, despite there's no gray cell if you try to mesure one individully. This shows you that when you try to measure something by isolating an element from its environment, you have modified the problem.

In fact the whiteness characteristic of each cell is not a completely isolated property, because it interacts with the instrument used to perform the measure (here your eye). The "grayness" of each white or black cell preexists in each cell, even if you can't detect it at the individual cell level. But it appears and becomes visible when you combine cells in checkers, in easily reproducible cells.

In other words, a measure made on individual elements is not lying, it's just that it does not say anything about how these elements interact really with each other and with the instrument used to measure it.

Some will argue that this is caused by imperfetions of our eyes, that cause it to see defective gray cells. My opinion is that every measurement instrument behaves like our eyes, and is subject to such illusion. In fact, it's not illusion, because this effect is easily reproducible and can be seen by everyone. The "illusion" is in fact a true physical fact, inherent to the problem that we try to measure, but this cannot be predicted according to tghe specs of each element (for example the specifications about the white or black nature of each manufactured cell in this problem).

Our universe is like this: what you can conclude about a problem in one scale is wrong and completely unrelated to what you can conclude on another scale. The combination of multiple individual elements is not the simple sum (or product) of the number of possible states for each element, it's more than that. And the more you add elements, the more interactions you are enabling, that will, at first time, be perceived as higher than expected error rates.

But in fact, this "chaotic" situation has an intrinsic logic. There's information hidden in each element, that is only perceived predicatbly when they occur in problems with larger scales. In other words: even the chaos is structured, and this structure is not inherent to the individual elments that make it, but to the number of interactions each element can create with nearby elements.If you try to measure something on an individual element, either you won't detet anything, and may conclude with a 0% error rate, or you won't conclude anything and will think that this element is completely unusable because its behavior is unpredictable. But you have modified the problem by your measure! What you have concluded is wrong, you have been fooled by the measure itself.

Remember about how specs are designed to describe manufactured elements: all these specs describe a very strictly limited environment, with severe conditions that will never be met when theses manufactured elements will be effectively used: temperature, voltage stability, isolated system, electromagnetical shielding, and so on... The specs are then not describing each element completely, because it's not possible to describe it completely at this isolated level.

This means that manufacturers of systems used by CERN were not necessarily lying in their specs. But the usage tested by CERN creates a testing environment that is not the same as the one used in the specs of each manufactured element (harddisk, memory unit, processor, network link...) and volontarily ignore some inherent states of these elements (plus many other conditions that were not even known to the designers, and that they currently describe as unexplained "error rates", that are only bounded within the limits of the experiments described in the specification, that will always try to simplify the problem to some reasonnable classes of typical applications.)

Now when you try to reuse these elements to solve larger problems in a more systematic way, you go far away from what was described in specifications, because nothing is said about the interactions of separate instances of these elements, given that this is completely left out to the users of these as their freedom to use (and experiment by themselves in their own conditions): no specification can describe the many and countlesss possible interactions where these manufactured elements will be used (even if the specs contained thousands of pages of documentations, and data tables, something that would rapidly become unusable for almost everybody, and extremely costly to produce).
0 Votes
+ -
Ignorance is bliss
xfer_rdy 18th Sep 2007
It?s foolish to assume storage device/systems companies will not ship a product with silent data corruption errors (undetected by hw or sw). These types of problems are real, difficult to create and even harder to recreate, isolate and fix. This becomes especially true when storage companies do not make adequate investments in development, tools, product quality, don't want to pay for high quality, experienced talent and try to do things "on the cheap" by off-shoring instead of optimizing their businesses.
Undetectable data errors are a fact of life, the problem is only going to get worse as we accumulate more data. The big question is: what to do about it ?

Unfortunately, the storage companies will not respond until something catastrophic occurs, like someone dies because their cat scans are corrupted, and the storage manufacturer's business suffers significant financial losses or someone goes to prison for criminal negligence. Customers shouldn't have to wait for such extremes.

End users need to demand the proper levels of quality from their suppliers. The manufacturers need to listen to the end users and not just their OEM customers.
0 Votes
+ -
Ignorance is ignorance
JonODonnell 18th Sep 2007
I never said that "storage device/systems companies will not ship a product with silent data corruption errors". What I am saying is that attributing a 5-7 order of magnitude increase in error rate to the disk systems in general is "foolish". This CERNs particular system configuration appears to have a very high error rate. There are may factors which could cause these errors (in no particular order):

1. The test was flawed.
2. Memory system errors manifesting as disk errors.
3. Bus communication errors
4. Memory corruption (could be OS or test software bugs)
5. More bugs/incompatibility in HD/RAID controller
6. Power supply or AC glitches
7. Hard disk errors
8. Many other factors

Of all of these error sources, the hard disk is the only self-contained piece of hardware which has an error rate associated with it. Jumping to the conclusion that this 1000000x increase in errors is caused by storage device/system company lying does not seem logical. It is not impossible, but it seems to be a stretch. The other items on the list should be eliminated first.
0 Votes
+ -
not necessarily a bug in specs
PhilippeV 18th Sep 2007
You forget one important thing: the design specs about the manufacttured devices are not necessarily fooled. They are only valid in very extreme conditions that are unlikely to be reproduced by customers : they assume very strict test conditions, anddonot reproduce the effective usage pattern. In addition, the qualifying tests performed to measure the accuracy of the built product are only valid with newly manufactured devices.
As soon as you have performed a test youhave used the device. Each test or usage is altering the quality of the device, due to intereactions with the measurement system: allsystems have then a lifetime which depends on its age since manufacring, but also on its past usage.
So whatever you can do, the error rates are constantly growing (due to the Eisenberg uncertainty principles, which is inherent to every physical system, this is the nature of matier).
But who is monitoring the quality degradation of the product once it is used in applications? No deviceis actually measuring itself its inherent degradation of its quality, and the increase of error rates(except possibly hard disks that implement SMART monitoring, to help predict when the specs will no more be valid,meaning that the device needs to be replaced even when it continues working as documented most of the time.

So what is happening? just the fact that the devices quality is constantly worsening, and there's currently no reliable way to predict when this will happen. If the CERN needs quality device, it must measure how fast the product quality is being degraded. This can't be predicted only from manufacturer specs, you need actual measures of the lifetime of these devices within the conditions where they will be needed and actually used.

Don't think that manufacturers can do 100% quality products. It's physically completely impossible to reach and maintain this level, becauseas soon as the product has been built, it starts aging, its specifications are constantly changing, and this goes faster each time the product is actually used (due to the interactions between the device with the experiment, or between parts of the device itself. And there's absolutely nothing you can do against that, except trying to slow down this degradation of performance by minimizing the impact of interactions: if you wanted to stop this degradation, the only way would be to stop using the device and isolate it completely as well as isolating each part of the device to stop them interacting with each other. But if you do that, then the deveice is completely unusable, you can't reduce interactions to zero without making the device completely useless.

Things that are altering the quality of the product are for example:
* the effect of temperature (which is chaotic andcompletely unpredictable in its local effects, unless you could operate it at a zero kelvin temperature, something impossible to reach.
* the impat of intersideral high speed andhigh energy particles.
* the effect of natural radioactivity, that exists in every kernel of every atome, and is also chaotic and not predictable at the local scale for us(here agains due to the Eisenberg uncertainty principle)
* the fact that it's completely impossible for us, at our level or in manufacturing processes to reproduce the exact state of anyt piece of material,because you can't know andpredict its exact state at any time (same reason as above)
All you candois trying to bound the uncertainty.

Remember what the CERN is trying to detect: these uncertainty principles and the apparently chaotic nature of matier is inherent to it. Scientists already know that we can't see andmeasure directly most of the material that makes the universe (look for articles speaking about the invisible "black" materia.) But what the CERN is trying to detect are new particles which can be detected by us only within extremely rare conditions. This requires unprecedented levels of precision, something that cannot be reached when using single devices, by could be measured by conducting an experiment at a very large scale, requiring commensal amounts of data. At this scale, even the smallest inherent (unavoidable) errors are combinining their effects.

The fact that there are defective nodes for this experiment was predictable. But the CERN did not need to demonstrate this fact, just to measure effectively how much these complex interactions were altering the data quality.

It may even happen that those 3000 devices that were seen defective during the experiment, could be tested again according to specs, and proven to be non defective according to these manufacturer specs. So the exact reasonwhy these deviceswere found failing may remain unknown to us, and completely unavoidable.

Toconclude this: a 0% error rate does not exist in every system; those manufacturers that pretend this (like ECC memory manufacturers that are arguing that such errors should never occur, are lying). Disk manufacturers are much more honest, they admit that there does exists a marginal rate of errors that you must be able to live with; To live with those errors, you need to measure how much they affect the results of your measurement, and be able to define a maximum bound of the error rates in the complex system you are building with these interacting devices.
0 Votes
+ -
Almost in Eisenberg's shoes
xfer_rdy 19th Sep 2007
Eisenberg never envisioned the multi-bit error, but Andrew Viterbi did. We all know disk drives will eventually return something other that what was written to it, eventually. And to be honest, I don't believe disk drives returning or writing bad really matters in redundant systems. However, it is very important for data storage systems to detect and maybe correct the bad data before being returned to the end user.

If raid system manufacturers know that disk drives will return bad data, one would assume there are systems in place in the raid controller to monitor data read from the disk and alert someone there's been an error. I think that?s a reasonable, but wrong assumption.

Some raid controllers only invoke error correction IF a raid drive fails or is removed from the system. Hmmm, that means escaped disk drive bit errors, normally occurring, are passed to the end user without intervention.

This completely changes the error rate profile of the system. A Sata drive will give us about 10^14 uncorrectable error bits, SAS is about 10^15. One PB of data is just about 10^16 bits. Calculating the delivered errors for a PB of data: SAS drives: 10 uncorrectable error bits, SATA drives: 100 uncorrectable error bits. The numbers are higher when factoring in raid controller buses, memory and CPU, the interconnect errors and the HBA and host computer BERs.

Although Eisenberg's models help us rationalize and model chaotic systems, it won't help cure poor design practices in systems. I believe the high error rates experienced in today?s storage systems can be significantly improved without delving into the 11th dimension, but by focusing on practical nuts and bolts solutions like T10-DIF and check pointing each data path in the raid controllers.
0 Votes
+ -
What about the other possibilities?
JonODonnell 19th Sep 2007
I have never stated that the drives could not be the source of the errors. All I am trying to say is that you are IGNORING any other possible explanation for the 100000x increase in error rate over spec beyond the disk system.
0 Votes
+ -
Data corruption has been there for years, it's just that nobody said anything. Now with larger programs and files, the corruption problem is getting much worse. I have had to wipe my drive and start from scratch lots of times over the years and it is getting worse. Something needs to be done, especially this day and age where we have gigabytes of photo, video and music files.
0 Votes
+ -
photo, video and music files
kd5auq 18th Sep 2007
photo, video and music files can usually tolerate data errors
without the user/viewer noticing. Program code and financial
data errors are only tolerated by politicians.
0 Votes
+ -
I see the problem...
reholli@... 18th Sep 2007
They should have been running Seagate HDs...

Seriously though, if they're only running this test scenario on 3Ware controllers and WD disks, then until it's repeated on other controllers and disks with the same outcome, I'm not sure I'd ascribe much validity to their results.
0 Votes
+ -
That' not serious
PhilippeV 18th Sep 2007
When performing their test, make sure that CERN was concerned by using technologies that were expected to show the best performance and accuracies, but even in this case, to verify the assertions stated by these specs, and trying to measure the (until now) not measured mrginal errors, simply because those theoretical specs were not actually verified, due to absence of enough verifiable data to measure it.

The fact that they use Seagate or Western Digital branded disks is not relevant: all well known harddisk brandings are outsourcing the building of their hardwares to the same set of third-party manufacturers, in plants located in China, the Philipines and son on and working for the same brands...

When you buy a Seagate or WD or IBM harddisk, you actually don't know where the model was actually built, because the harddisk is effectively made of various components of different origins, varying in time depending on market conditions and management of stocks: all these transactions are based only on specifications which are assumed to be verified depending on conditions that are partly specified (because not all conditions are known, and because of limitations in our known science).

So there always remains incertainty within EVERY build process (just ask why the processor makers need to check each die to select only a few parts, and why so many components are rejected after build, or reused at less demanding performance and sold at lower cost...)

A specification that pretends to have no error is lying everytime. There's a theory, but in every process there's a margin of incertainty that is intrinsic to the way we can measure things. The principle of uncertainty of Eisenberg should be a good reading for you. It demonstrates that you CANNOT build any system that is 100% reliable, but that you can take profit of these "errors" by combining the effects of multiple independant measures (or systems) used in patterns that allow reducing the impact of errors in measures.

Every time you take two systems to create a combined new one, your new system is intrinicly more powerful than the mere sum of the two, but their combined "error" rates are not currently estimated. A significant part of these "errors" are in fat creating new reproduceable states that can be used for computing purpose, so that what was considered an error in one part is now useful information useable ina higher level.

When you build systems integrating lots of combined elements, like hard disks with lot of storage elements, or processors with lots of transistors, the effects of the combinations is marginally known, and currently estimate (incorrectly) as being the sum of the errors produced by each element.

It's time to renew our way to compute error rates in complex system combining many elements. The specs will never say everything, you also need to measure the impact of the combination, because these elements won't work isolately but in association, in interdependant ways (that produce more errors than expected but also many more useful states that are forgotten).

You can't ignore now the effects of the combination of separate elements. The only way to take them into account is to measure them. That's what the CERN has done in its measure campaign, and it is right in doing that !

That's a perfectly valid scientific way of working: experiment and mesure, then conclude, do not trust only the theory introduced by specs and simplified models, because these will necessarily ignore many things that are NOT measurable at the individual element level, but only when they are used in combination.

There's no magic inthis result. I even think that these higher "error" rates were unavoidable, because the scale of the problem to solve has completely changed : these manufactured elements were not designed initially to work at this scale (and the specs did not assert that they would persist as valid at this scale when working on petabytes stored in many separately built units, whose combination was unpredictable before being activly measured).
This report is interesting: this demonstrates that the specs about absence of errors in memory are not enforcable with the current implementation.

This means that new technology must be implemented to reduce the possibility of double-bit errors, possibly allowing these errors to be corrected to reach (or better approach) the expected zero error result.

This means new ECC memory techonology is now needed... But note that most PC users do not even have ECC memory, because non-ECC memory is cheaper.

When the CERN will start analyzing for many months its petabytes of data, it will probably need to support this work using thousands of PC working in a distributed network. This report has aconsequence: their results will not be usable unless the generated results are not only double-checked, but checked multiple times (in other words, a much larger "galaxy" of nodes will be needed to process the data in a distributed network).

Anyway, the interesting thing in this report is that memory is not the main source of errors. Networking components and hard disk technology are much problematic, by many orders of magnitude !

This report also sugests that RAID technologies for disk storage need to be improved.

This means: better verification system, better ECC algorithms, better data signature (using safer digital signature algorithms for large files, than just the basic MD5 or SHA1: time to implement SHA512 digest algorithm in RAID systems to detect data corruption on one disk and be allowed to select the data from another disk?)

May be we'll need even stronger security algorithms for message digests, if SHA512 is not enough for the CERN requirements on petabytes of data? But what about legal concerns when improved algorithms are still restricted by law as military-grade technologies, restricted from import/export or even use?

Now consider these reported figures, and apply them to the worldwide volume of financial transactions: these easily contain petabytes of data interchanged between lots of agents using various level of security. This data is no longer reliable as it may contain many more errors than what was expected. This has legal consequence because the digital signature schemes used as proof of transaction is now discutable.

So not only the CERN is concerned, but everyone with the worldwide bank, financial, and fiscal data used by administrations or by credit card processors in online commercial transactions.
0 Votes
+ -
1000 terabytes = 1 petabyte
jinko 18th Sep 2007
The "next size up" has a name - Petabyte

http://en.wikipedia.org/wiki/Petabyte
0 Votes
+ -
1024 terabytes = 1 petabyte
jjarman 18th Sep 2007
believe it or not memory goes up in multiples of 1024, not 1000 like some harddrive manufactures would like you to believe...scammers. They also refer to unformatted unpartitioned space, not available space once you lay down the data structures.
Ever wonder why drives always have less space they claim when you first install them?

My primary HD 700gb only has 698.32gb once formatted. 1.68gb short. I remember when 1.68gb was no small number.

btw.

8 bits in 1 byte
1024 bytes in 1 kilobyte
1024 kilobytes in 1 megabyte
1024 megabytes in 1 terabyte
1024 terabytes in 1 petabyte
1024 petabytes in 1 exabyte
1024 exabytes in 1 zettabyte
1024 zettabytes in 1 yottabyte
etc.
0 Votes
+ -
kilo, mega, tera, and so on, are powers of 1000, and hard disk drive or network solution manufacturers are right. These are internationally standard units.

It's only memory andcprocessors manufacturers that have used powers of 1024, due to our current computing models based on binary systems.

But the binary computing model is very near to explode, due to phusical constraints. Tomorrow computing will use the capacity of electronic components to work in a non binary (yes/no) model, by associating the computing powerof two neighbour transistors to create ternary or pentary numeric systems, becauser they will offer higher level of integration than existing binary systems....

Just imagine that binary and pentary electronic components are associated in the silicium of the processor, you create a decimal computing model.

Same thing about storage and network transmission media : why are we restricted to use only binary systems? Compatibility is not a problem, because a decimal computing system is perfectly able to solve the same binary problems. And the needed algorithms are not necessarily very complex, and can be integrated in the silicium as well.

don't use "kilo, mega, tera, peta..." (and abbreviations "k,M,T,G") for powers of 1024. The SI defines standardized names for these prefixes: "kibi, mebi, tebi, pebi..." (and abbreviations for them: "ki,Mi,Ti,Gi").

Some softwares and documentations are currently being updated to remove this ambiguity caused by incorrect usage of units by the inherited old binary computing model. This is the way to go, and it's not the harddisk manufacturers that are lying, but the industry of processor and memory manufacturers (and designers of our old OSes, that are pressured to show accurate units, including in their GUI interfaces and documentations).

Who is lying and creating confusion since decennials? Intel, Motorola, Microsoft, ... This confusion is near to explode. Linux supporters are advertizing the correct use of binary-based units where appropriate. It's time for Microsoft to do the same thing in Windows and stop the confusion, not tohard disk manufacturers that have always been right.

And the effective capacity of formatted hrd drives depends on OS implementations, and there are different OSes, so harddisk manufacturers are right because they display figures that do not depend on specific OS implementation of their filesystem, or specific usage of these disks like in RAID systems.
0 Votes
+ -
I understand the metric system...
jjarman 18th Sep 2007
unfortunately this convention for both memory and storage has been the same since the beginning of the industry and only recently has there been a shift by hd manufactures to label the other way. The prefixes may not be correct, but the unit of measure has been well established.

None of this is or was up to me personally. I didn't make these decisions. All I'm doing is describing what already is. I personally support the metric system and could discuss all day how things should be. happy Doesn't mean I don't understand the history of the industry or how things are now.

Yes, each filesystem type requires a different amount of space from the raw disk, but the ranges are all fairly similar and well known. Hard drive manufactures used to talk about formatted space, I remember when that switched that as well and why, again understanding the industry history helps bring this conversation from speculation to what actually is.

If you've been around, you understand why things have grown the way they have. Remember MFM or RLL? wink

Cheers.
0 Votes
+ -
Interfaces
filker0 18th Sep 2007
MFM, RLL -- how the data was recorded. There were others as well, like GCR, and EFM. There were many other recording methods used on hard drives. The analog electrical interfaces to the media were common on PC controllers, but larger systems tended to present a digital interface that was independent of the recording method used to store the data. SASI, SCSI, SMD, and ESDI, all digital interfaces that I used on one system or another, often had MFM or RLL drives hidden behind the on-drive electronics. IDE moved the bulk of an ESDI controller to the disk side of the cable.

I not only remember MFM and RLL drives, I built a device (wire-wrap, solder, a few PLAs, an Intel 8751, and some off-the-shelf parts) to allow a system that expected two MFM disks to be plugged in to a single SASI drive. Pointless? It was cheaper (I got the 20MB SASI drive surplus) than two 10MB MFM drives at the time, and turned out to be very slow if the system (Zeus Unix SysIII, Z8000 CPU) was trying to access both drives at once. Still, it worked.

Eventually, I opened up the 20MB SASI enclosure, and there was an RLL drive and a RLLSASI adapter inside.
0 Votes
+ -
This certainly is one explanation of some problems encountered from time to time by us smaller data pushers. I would think the error rate is likely higher for home and small office PC users due to the fact that these machines are often not professionally installed. Most perform with unconditioned power and airflow, fill with dust that is seldom removed, are subject to outside electromagnetic disturbances (radio transmissions)and are thermally cycled to save energy costs. I'm sure the list of possible afflictions is much longer.
Couldn't the unexpected high level of errors suggest that they are the result of still undetected particles,exactly what the CERN is looking for?

These figures about errors may be very interesting in fact for the CERN, because it could be a measure of the prevalence of these undetected particles, and where/when they occur, and why this affect the stored/computed data.

So our billions of computers used worldwide would become new detectors for the rare particles that the CERN is looking for. If we can collect error rates on millions of processors, hard disks or network links, we have a new tool to estimate reliably the prevalence of these undetected particle in our universe, and an estimate of the work that is needed to find and classify this "invisible" material.

So, instead of building costly particle colliders and detectors, why not using the billions of computers, hard disks, andnetwork links as distributed detectors?

This research will interest the manufacturers of processors and hard disks, because if a new particle is finally detectable and measurable, it allows building a counter-measure in the storage devices so that they canimprove their error detection rate and store data more reliable, even in presence of these "rare" particules that are prevalent (but only detectable in rare events where they do interest with out visible hardware).

Just consider the neutrinos : we know they exist everywhere and that we cannot build anything to shield us from their rare effects (neutrinos from from everywhere in the universe, and most of them can traverse the planet without colliding with anything, but when they do interact, their effects are extremely visible, because they intrinsicly transport high levels of energy).

So may be we are seeking for even more rare particles, that collide in even more rare cases, but that produce stronger effect when they do, because they transport even larger amount of energy. Superneutrinos ?
0 Votes
+ -
2nd Law effects?
GreyGeek 18th Sep 2007
Instead of seeing proof for undetected sub-atomic particles I have a feeling we are running into 2nd Law effects as outlined by Claude Shannon. NOTHING operates at 100% efficiency, even data (re)transmission.

When the volume of data being moved around gets so large, normally micro error rates began to reveal themselves by showing up as corrupted data at the macro scale.

So, like the perpetual motion machine, we reach a thermodynamic limit to our ability to move with error large amounts of data.
ZFS fixes these problems. Read this short article to get more information about this problem:
http://queue.acm.org/detail.cfm?id=1317400

Join the conversation!

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]
ie8 fix

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

Facebook Activity

White Papers, Webcasts, & Resources
ie8 fix