X
Business

Realtek silent data corruption caused by firmware

A little more than a week ago, I took Realtek and Microsoft to task over some Realtek WHQL certified drivers causing silent data corruption.  As it turns out, the problem was actually caused by a Firmware flaw in Realtek's hardware.
Written by George Ou, Contributor

A little more than a week ago, I took Realtek and Microsoft to task over some Realtek WHQL certified drivers causing silent data corruption.  As it turns out, the problem was actually caused by a Firmware flaw in Realtek's hardware.  This means the problem would have affected any operating system regardless of the driver if it used the large send and checksum offload hardware feature of the gigabit network adapter from Realtek.  The reason I thought it was a driver problem was because upgrading the driver fixed the problem, but it turns out the driver actually updated the firmware in the on-board network adapter.  Here's what a Microsoft Engineer had to say: 

I was wondering if you had a moment to talk about the article you wrote about the Realtek silent data corruption issue and WHQL testing. I have been heading up this investigation on our end and have some interesting reasons as to why this issue was missed. On the surface, it seems that any basic test should have been able to catch this, and that it should have never passed any sort of testing. However, this bug only manifests itself under a pretty unforeseeable circumstance. First, checksum offload and large send offload must both be enabled in the driver. Checksum offload and large send offload are things we do test, but what makes the circumstances a little more strange are that a small packet, 58 bytes or less, must be sent before a very large packet. These are things we also test for, small packets and large packets. The problem in testing comes in that we don’t test all of these together at one time. Now, unfortunately in a real life situation this can happen pretty easily if let’s say you are running BitTorrent client while trying to transfer files on your local LAN with both offloads enabled. Now, the tricky part in testing is that we have hundreds of individual tests we run against each driver before it gets certified, and there is a nearly infinite combination of tests that we could run in combination. We are working on that however, a way to run these tests in many combinations.

Now, you are probably thinking why don’t we have just a simple test where you transfer some files over background traffic. Well, considering the level of testing we already do, this kind of a test is unlikely to find any other bug except this one. However, we are discussing adding such a test just to be safe. Secondly, the tests need to be deterministic. Each of our tests checks for something in particular: checksum offload, large packets, small packets, etc. If a test fails like moving files over background traffic it does not necessarily tell us what is wrong with the driver and thus takes a long time to debug the point of failure. In a case such as silent data corruption, it would be obvious what was wrong, but often times it will not be so obvious. Again though, we are still evaluating these approaches.

At any rate, as always, I would value your opinion on the matter now that you have a more complete story. We are taking this problem very seriously and of course are trying to improve the testing where we can.

The main responsibility for testing this goes to Realtek and I certainly hope they've learned a valuable lesson here and avoid these kind of bugs with more rigorous testing.  The secondary responsibility goes in to the OS vendor to validate the hardware though I have to admit that this particular problem was a difficult one to discover for both Realtek and Microsoft.  I understand that adding another test procedure makes the testing process a little more time consuming, but anything's better than forcing the customer to do the beta testing of hardware and firmware.  I'm happy to hear that Microsoft and Realtek are working together to figure out these problems and I hope not to see this kind of bug again.

Editorial standards