[Update 8/8/2007 - Realtek silent data corruption caused by firmware]
One of the three most dreaded phrases in the computer world is "SILENT DATA CORRUPTION". Your data gets corrupted just enough that it isn't readily detectable by most applications and operating systems and you think your data's good until you actually need to use it. This weekend as I was doing some routine maintenance tasks on my home computer and moving some data over my Gigabit LAN (now cheap and common), I got bit badly by silent data corruption.
My Realtek network adapter which is one of the most ubiquitous on-board Gigabit Adapters in the world was the culprit and it had been causing me some massive grief for months and I just didn't know it. Almost every modern Desktop Motherboard I know uses this particular on-board Gigabit adapter and I have to wonder how many millions of people are being affected by this issue and I have to wonder if this problem exists in any of the Server-based adapters from Realtek. More specifically, Realtek driver version 6.191 was the culprit.
The problem had gotten so bad that if I dared use anything like µTorrent in the background, the data corruption rate was so bad that I couldn't send any email attachments. Even my Windows Update downloads got severely corrupted causing a permanent inability to update Windows Vista and I had to spend half a day with a good Tech Support guy from Microsoft and some Microsoft developers to get the update problem fixed. Initially I was wondering if this was caused by uTorrent but it turns out that uTorrent was merely the trigger and it was the more extreme case because it transmitted and received so much more data.
So when I was transferring some videos from one computer to another today, I noticed that the playback was filled with playback artifacts. I remembered that the file copy operations would force me to retry once or twice per file. The resulting videos had severe artifacts during the playback and I knew something wasn't right. I downloaded a copy of Advanced CheckSum Verifier which generates a text file list with MD5 checksums that will tell me if the files have been altered. It turned out that all but the smallest files in the directory I copied had been altered which means the data was being silently corrupted.
I shut off uTorrent and tried the file transfer again and Windows Vista didn't prompt me to re-copy anything which was a positive sign. I ran the checksum again and found that although the hundred megabyte file had copied correctly, two of three gigabyte sized files were corrupted. This tells me that there is approximately one silent transmission error for every billion bytes sent so now I'm left scratching my head. The error rate had definitely declined but the problem hadn't entirely gone away. Then I realized that Skype and MSN (while hardly active) were still running on the PC in question so I shut off Skype and MSN and tried to send the files again. As I suspected, the transmission errors stopped and every file passed the MD5 checksum test.
At this point it was obvious that something was wrong with the network subsystem on the machine that could only reliably transmit data when just one application was using the network at a time. I suspected that maybe it was the network driver so I upgraded to the latest 6.195 driver (downloaded from here). I then ran the torture test with uTorrent going full blast while copying a few gigabytes of data to the other computer and everything copied without a single checksum error even under the worst conditions. So it's obvious that Realtek driver version 6.191 had been the culprit all along and it had caused me a lot of grief. The problem is that now I'm worried about what else I corrupted during the last four months.
The immediate lesson to my readers is that if you better check your drivers because there's a good chance you have Realtek network adapters. If you do, it would be a good idea to upgrade to the latest version. The long term implications are a bit more complex because I have to wonder how driver version 6.191 got through hardware qualification at Realtek and I also have to wonder how it got through Microsoft's WHQL (Windows Hardware Qualification Labs).
Why aren't Realtek and Microsoft doing this type of multi-gigabyte multi-application data transmission testing? There is an expectation that WHQL means quality given the fact that the Q in WHQL stands for "quality". Why can't Windows Vista (or any other Operating System) have more robust file copying capability to overcome these types of transmission errors and why can't Windows Vista do checksum testing to warn the user if there is data corruption? I realize that this is more CPU intensive but we're in the era of multi-core CPUs and I don't think it's unreasonable for users to expect some level of reliability.