Realtek network driver silently corrupts data

Realtek network driver silently corrupts data

Summary: [Update 8/8/2007 - Realtek silent data corruption caused by firmware]One of the three most dreaded phrases in the computer world is "SILENT DATA CORRUPTION".  Your data gets corrupted just enough that it isn't readily detectable by most applications and operating systems and you think your data's good until you actually need to use it.

SHARE:

[Update 8/8/2007 - Realtek silent data corruption caused by firmware]

One of the three most dreaded phrases in the computer world is "SILENT DATA CORRUPTION".  Your data gets corrupted just enough that it isn't readily detectable by most applications and operating systems and you think your data's good until you actually need to use it.  This weekend as I was doing some routine maintenance tasks on my home computer and moving some data over my Gigabit LAN (now cheap and common), I got bit badly by silent data corruption.

My Realtek network adapter which is one of the most ubiquitous on-board Gigabit Adapters in the world was the culprit and it had been causing me some massive grief for months and I just didn't know it.  Almost every modern Desktop Motherboard I know uses this particular on-board Gigabit adapter and I have to wonder how many millions of people are being affected by this issue and I have to wonder if this problem exists in any of the Server-based adapters from Realtek.  More specifically, Realtek driver version 6.191 was the culprit.

The problem had gotten so bad that if I dared use anything like µTorrent in the background, the data corruption rate was so bad that I couldn't send any email attachments.  Even my Windows Update downloads got severely corrupted causing a permanent inability to update Windows Vista and I had to spend half a day with a good Tech Support guy from Microsoft and some Microsoft developers to get the update problem fixed.  Initially I was wondering if this was caused by uTorrent but it turns out that uTorrent was merely the trigger and it was the more extreme case because it transmitted and received so much more data.

So when I was transferring some videos from one computer to another today, I noticed that the playback was filled with playback artifacts.  I remembered that the file copy operations would force me to retry once or twice per file.  The resulting videos had severe artifacts during the playback and I knew something wasn't right.  I downloaded a copy of Advanced CheckSum Verifier which generates a text file list with MD5 checksums that will tell me if the files have been altered.  It turned out that all but the smallest files in the directory I copied had been altered which means the data was being silently corrupted.

I shut off uTorrent and tried the file transfer again and Windows Vista didn't prompt me to re-copy anything which was a positive sign.  I ran the checksum again and found that although the hundred megabyte file had copied correctly, two of three gigabyte sized files were corrupted.  This tells me that there is approximately one silent transmission error for every billion bytes sent so now I'm left scratching my head.  The error rate had definitely declined but the problem hadn't entirely gone away.  Then I realized that Skype and MSN (while hardly active) were still running on the PC in question so I shut off Skype and MSN and tried to send the files again.  As I suspected, the transmission errors stopped and every file passed the MD5 checksum test.

At this point it was obvious that something was wrong with the network subsystem on the machine that could only reliably transmit data when just one application was using the network at a time.  I suspected that maybe it was the network driver so I upgraded to the latest 6.195 driver (downloaded from here).  I then ran the torture test with uTorrent going full blast while copying a few gigabytes of data to the other computer and everything copied without a single checksum error even under the worst conditions.  So it's obvious that Realtek driver version 6.191 had been the culprit all along and it had caused me a lot of grief.  The problem is that now I'm worried about what else I corrupted during the last four months.

The immediate lesson to my readers is that if you better check your drivers because there's a good chance you have Realtek network adapters.  If you do, it would be a good idea to upgrade to the latest version.  The long term implications are a bit more complex because I have to wonder how driver version 6.191 got through hardware qualification at Realtek and I also have to wonder how it got through Microsoft's WHQL (Windows Hardware Qualification Labs).

Why aren't Realtek and Microsoft doing this type of multi-gigabyte multi-application data transmission testing?  There is an expectation that WHQL means quality given the fact that the Q in WHQL stands for "quality".  Why can't Windows Vista (or any other Operating System) have more robust file copying capability to overcome these types of transmission errors and why can't Windows Vista do checksum testing to warn the user if there is data corruption?  I realize that this is more CPU intensive but we're in the era of multi-core CPUs and I don't think it's unreasonable for users to expect some level of reliability.

Topics: Networking, Hardware, Microsoft, Windows

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

57 comments
Log in or register to join the discussion
  • thanks

    That clears up a couple of blue screen mysteries that went away after changing a NIC.
    yagijd
    • Did you have the same problem?

      Did you have the same problem with the Realtek NIC drivers?
      georgeou
  • Not just the driver

    George, TCP/IP is [b]supposed[/b] to be robust in cases like this -- bitflips lower in the stack [b]should[/b] be caught at the datagram level.

    If a driver error can corrupt data, it's because the operating system isn't verifying it higher up.
    Yagotta B. Kidding
    • I wonder how other OSes handle this

      I wonder how other OSes like Linux and OS X (BSD) handle this.
      georgeou
    • Not just the driver

      I'm pretty sure TCP only protects its own headers. The data is not checksumed because that would be too computationally intensive.

      ernie
      DennisErnst
      • From my ancient...

        kernel as found in the /usr/src/linux/ipv4/ip_output.c file

        /* Generate a checksum for an outgoing IP datagram. */
        __inline__ void ip_send_check(struct iphdr *iph)
        {
        iph->check = 0;
        iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
        }
        Cardinal_Bill
        • Its just the header

          http://www.tcpipguide.com/free/t_IPDatagramGeneralFormat.htm
          DennisErnst
          • Yep.

            From my quick read of the TCP/IP Network Administration book by O'Reilly it looks like the Application Layer is responsible for the verification of data integrity. So, if I read it correctly and this is the case, exactly what application was he using to transfer the data and why didn't it catch the error(s) everytime?

            Of course you could also look into the compatibility of the NIC's and whatever lies between the two computers (routers/switches/etc.)
            Cardinal_Bill
          • Actually, it's the data too

            The IP "wrapper" only checksums the header, but the TCP packet which is contained inside of the IP packet does checksum the data, as does UDP.

            TCP is meant as reliable, in so far as it represents a stream of checksummed data; the protocols should maintain the data packets in the correct order, and automatically retransmit any dropped or corrupted packets.

            UDP is considered unreliable, in that packets may be delivered out of order or dropped altogether; a checksum is used to look for corrupt data, in which case the packet will generally be dropped and just disappear rather than being retransmitted.

            Here is a better resource:

            http://www.protocols.com/pbook/tcpip2.htm
            fde101
  • Can you check this?

    I'm starting to worry. My motherboard also has a Gigabit adapter. My machine runs on Ubuntu 7.04. This is what Linux says about the adapter:

    *-network
    description: Ethernet interface
    product: L1 Gigabit Ethernet Adapter
    vendor: Attansic Technology Corp.

    Is this the same adapter as yours? I see no mention of Realtek, only of Gigabit.

    Greetz, Pjotr.
    pjotr123
    • This is the driver:

      driver=atl1 driverversion=2.0.6
      pjotr123
    • This was a driver issue

      1. I'm not sure you have Realtek hardware.
      2. This is a Windows Driver issue.

      If you're concerned about it, do the type of multi-application test where you transmit a few gigabytes and see if there are any errors.
      georgeou
  • Nice article George!

    I've just spent a while checking the office kit via remote login. Fortunately the controllers in all the server are Tornados (Dell & HP kit) or Intel on the desktops! Phew!!

    [i]"I suspected that maybe it was the network driver so I upgraded to the latest 6.195 driver"[/i]

    Of course, the problems is that at one point in time, this *was* the latest driver. If we go around testing every single component because we can't trust them then we might as well build the software ourselves a la open-source! This makes your next point about WHQL certification very pertinent - how *did* this get certified?

    Finally (and I know George can't answer this) an additional consideration is other OSes. Is this a generic fault across the driver for all OSes or is the damage limited specifically to Windows? If anyone out can test this on Linux or Mac, posting the results here or contacting George with them could be very useful. Was this driver even availabel for other OSes?

    Perhaps George could add a footnote to the blog entry asking for submissions on this point?
    bportlock
    • Updates for other OS

      From looking at the dates on the RealTek website it says

      Linux - No driver provided by realtek, driver is built in to the kernel

      Mac OSX - Current driver is 23 Mar 2006 which is a year older than George's faulty driver.

      Various Unixes - drivers even older than OSX
      bportlock
    • I'll post the answer if someone tests the Linux drivers

      Thanks.
      georgeou
      • See my post above...

        ... it seems that (according to Realtek) Linux has a native driver for this and the OSX driver is getting on for 18 months old.
        bportlock
        • built into kernel: still a Realtek driver

          Linux has nearly all hardware drivers built into the kernel. But they are still just that: drivers. Sometimes supplied by the hardware manufacturers, sometimes coded by Linux developers. So, these drivers can have issues as well, just as Windows drivers can.
          pjotr123
          • I understand what you're saying...

            ... but from the text on the RealTek site it would *seem* that they don't write that driver. Maybe that's a bit much to be reading into their words, but it seems to me that if the issue George outlined was occurring in the kernel drivers then it would have been flagged by now. By the nature of Linux, many systems get used as servers and moving huge volumes of data is a server's "bread and butter".

            In any case the Linux driver is liable to be substantially different from the Windows version that George found the fault with, if only because the Windows one is 4 months old.
            bportlock
          • Thanks, somewhat reassured

            Indeed, it's unlikely that it's basically the same driver, given the release date of George's driver. That's somewhat reassuring. Thanks for replying.

            Greetz, Pjotr.
            pjotr123
          • I was just surprised something this serious wasn't caught in QA

            I was just surprised something this serious wasn't caught in QA. Maybe they only tested it with one stream at a time and this problem occurs when there are two applications trying to use the network stack at the same time.
            georgeou