What's the real story on the Windows Home Server data corruption bug?

What's the real story on the Windows Home Server data corruption bug?

Summary: Last week, an alarmingly terse Knowledge Base article got the undivided attention of Windows Home Server users with its warning that they risk data corruption if they edit files stored on a home server using a handful of popular programs. How widespread is this bug, really, and why wasn't it caught during the long beta test cycle? I've got some inside information.

SHARE:

In the software industry, data-damaging bugs are every product manager's nightmare. When a reproducible bug in this category is identified, sirens go off, vacations get canceled, engineers lose sleep, and product managers pop Maalox until it's fixed.

That's the context behind the alarmingly terse Knowledge Base article 946676, published last week. The entire article encompasses only a few sentences, but it got the attention of anyone using Windows Home Server:

When you use certain programs to edit files on a home computer that uses Windows Home Server, the files may become corrupted when you save them to the home server. Several people have reported issues after they have used the following programs to save files to their home servers:

  • Windows Vista Photo Gallery
  • Windows Live Photo Gallery
  • Microsoft Office OneNote 2007
  • Microsoft Office OneNote 2003
  • Microsoft Office Outlook 2007
  • Microsoft Money 2007
  • SyncToy 2.0 Beta

Additionally, there have been customer reports of issues with Torrent applications, with Intuit Quicken, and with QuickBooks program files. Our support team is currently trying to reproduce these issues in our labs.

I asked a senior member of the Windows Home Server team for more details yesterday. Here's what I learned:

This is not an issue that affects every Windows Home Server installation, and the symptoms require several factors that are not mentioned in the KB article. The largest contributing factor is when a home server is under extreme load. If you're doing a large, highly demanding file copy operation in the background and you're using one of the listed applications to edit a file that's stored on a shared folder on the home server, and you save the edited file to the server, then you might see this bug.

In fact, it took a long time to get a reproducible series of steps for this issue. A number of reports of data corruption that appeared to be related to this issue turned out instead to be traceable to faulty network cards, hard drive failures, or old routers with outdated firmware. It took some very detailed bug reports, accompanied by sample files and server logs, to create a consistently reproducible environment in the lab; that's the missing piece that it takes isolate the root cause and develop a patch.

Meanwhile, backups stored on a Windows Home Server are completely safe, as are files copied to the server for safekeeping or streaming. This issue affects only files that are saved directly from one of the listed applications to a shared folder on a Windows Home Server.

No one I talked to at Microsoft is minimizing the impact of this bug. That bare-bones KB article was specifically designed to "get people to take it seriously," I was told.

So why wasn't this issue identified months ago, during the long beta test cycle for Windows Home Server? That's the trouble with beta testing, as I know from firsthand experience. Last summer, after the Windows Home Server beta cycle had officially ended but before the software had been released to the public, I noticed that some program files stored on my custom-built Windows Home Server box were being mysteriously corrupted. Trying to open the file didn't open a Windows installer, as expected; instead, a Command Prompt window opened for a split second and then closed without doing anything. The file icon was changed to a generic MS-DOS icon, and the file properties suggested that these Windows programs had mysteriously been transformed into MS-DOS programs. It didn't affect every program, and the corruption seemed to be random.

In searching through bug reports, I found two or three other, similar reports, all of which had been closed as "not reproducible." I filed a report anyway and heard back from an engineer who peppered me with questions. Over the course of the next few days, we narrowed down the scope of the bug and created a repro test case:

  • The files had to be fairly large, at least 2 or 3 megabytes in size.
  • They had to have been downloaded from the Internet on a Windows machine, which in turn adds an alternate data stream (ZoneIdentifier) that blocks execution of the file without user consent.
  • They had to have been uploaded to the Windows Home Server from a machine running Trend Micro antivirus software. Other AV and security programs didn't trigger this bug.

That's a fairly complex series of conditions, and it's not surprising that it took some time and sleuthing to identify the exact sequence of conditions. But when the issue was documented in Knowledge Base article 943393, none of those additional details were mentioned.

That bug  was patched within a few weeks after the KB article was published (the details are in KB article 941914), and the fix was pushed out in mid-November to any Windows Home Server box via Windows Update.

I fully expect the current bug to be patched fairly quickly now that a repro case is available. Meanwhile, it pays to be conservative and heed the advice of that KB article, even if the odds are relatively low that this particular bug will strike you.

Topics: Windows, Hardware, Operating Systems, Servers, Software

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

117 comments
Log in or register to join the discussion
  • apology accecpted (nt)

    :(
    n0neXn0ne
  • That's OK Ed...

    ..it's not like WHS is even a BIG market anyhow.
    D T Schmitz
  • Home users combing MS KB's?

    I can't really see most home users combing MS KB articles, can you? Most are lucky to be able to find the power switch.

    When you pay for a software...at least when I pay for it...I expect it to be a solid product that adds value for the money. With MS products I find myself frequently going back to the KB articles to sort out the issue of the day.

    I do the same thing on Linux, btw, only in the Ubuntu forums instead of MS KB's. The difference is I pay for one and the other is free. I don't have to justify the value proposition with Ubuntu. It's a functional, robust operating system fully loaded with a feast of productivity applications and hundreds more mere mouse clicks away. In exchange for that value I have to occasionally spend time in the Ubuntu forum figuring out a minor tech issue, though not nearly as often as I did with 7.04. Fair trade.

    Not sure it's reasonable to expect home users to be patient with something like this or be able to figure out how to resolve it. And if you are, what's the advantage to giving MS your money? I see money changing hands but where's the value?
    Chad_z
    • Windows Update

      When the fix is available. it will be delivered via Windows Update, automatically. You shouldn't need to comb any KB articles, and most people - the overwhelming majority - will never encounter this bug. There were more than 100,000 beta testers for WHS, and so far this issue has been reported by "several people."

      The average user who buys WHS and uses it for its primary purpose of backing up home PCs and streaming music over the home network will certainly never see it. The fact that the issue is being jumped on quickly and has been documented in a KB article is a good thing, IMO.
      Ed Bott
      • But then again

        if there's a fix available in Ubuntu, you get automatically updated as well. So again, where's the added value?

        I don't question that MS is jumping on this. I do question how this got by their beta testing. Was this error reproduced on the beta builds too (in other words, is it possible that only the RTM is affected)? If the error is on the bets builds too, then they have to seriously question their whole beta process. Out of 100,000 testers, this scenario should have popped up more than once.
        Michael Kelly
        • If you see no value then don't buy it.

          Yes, it really is that simple.
          ye
          • re: ... don't buy it? It comes preloaded! (nt)

            x-(
            n0neXn0ne
          • re: don't buy it? It come preloaded

            n0neXn0ne:

            Preloaded on what? An HP MediaSmart WHS box, for example?

            If you don't want one, don't buy it. Sheesh!(tm)
            M.R. Kennedy
          • too late now ... (nt)

            x-(
            n0neXn0ne
          • Hah!

            You bough a WHS box, only to be surprised that it came preloaded with WHS!

            That's nice! Try looking before you buy next time...

            (Man, that's rich stuff, there, always fun to tell coworkers of the C/ZDnet idiot move of the day!)
            KTLA
          • On what?

            I am interested in learning more about this....
            gadawg2
          • No kidding

            But we were hoping that MS would provide some added on value. And having an engineering core that looks after little things like this on the customers' behalf usually is one of those added values that proprietary gives you over F/OSS.

            I actually like the idea behind WHS. I think the features they offer do provide value. But file corruption takes away all the added value that those features provide. So I agree, if we're not going to get the added value we bargained for, we should not buy it. Though I'm sure MS has better business sense than you do and will try to address the problem rather than blow the customer off and say "if you don't like it, don't buy it."
            Michael Kelly
          • So you're saying it doesn't add value because of a bug?

            Seems pretty stupid if you ask me.
            ye
          • Now that's odd...

            Because you seem to decry, shall we say any and all non-Microsoft products, if they have a bug or two.

            Ah well...
            ego.sum.stig
          • Data corruption is a show stopper

            What the point of this server if it's going to corrupt you data?
            voska1
          • Show stopper, yes, but...

            This particular bug doesn't affect backups. It doesn't affect music streaming, It doesn't affect file copying or moving. It doesn't affect opening, editing, or saving files with applications that are not on the list. The simple workaround is to avoid saving files directly from one of the applications listed here.

            So yes, it's a show stopper bug as is any data-damaging bug, and it should be treated as a top priority. But its impact on the day-to-day use of a WHS box is very, very llimited.
            Ed Bott
          • I fully agree with what Ed is saying

            Yes, this is a major bug, and yes, this is being given top priority and will be fixed.

            So moving forward, I hope that MS takes a long hard look at their beta process and figures out how something so debilitating and so commonly used could have slipped by. We shouldn't dwell on this, but they certainly should.
            Michael Kelly
        • That was the point of the second half of this post

          "Out of 100,000 testers, this scenario should have popped up more than once."

          It did. There were reports that were filed during beta testing, but they couldn't be reproduced. In fact, even after the issue was reported after RTM, it took a lot of work and cooperation between a test engineer and a beta tester who was willing to share logs and detailed steps to help find a repro case.

          Look, file corruption bugs get reported all the time. Most often, they're related to problems with hardware or third-party apps. This particular bug is apparently quite hard to reproduce. So someone hits it once, files a bug report, it can't get repro'ed, the engineer asks for more details. The beta tester may try to repro it and not be able to do so because the particular set of circumstances that triggered it is no longer operative, or they're too busy, or they have rebuilt the test system, or they don't have the skills to do a really controlled test.

          That's the problem with beta testing. There's a lot of noise to go along with the signal.
          Ed Bott
      • Big difference there

        [i]When the fix is available. it will be delivered via Windows Update, automatically[/i]

        Now there's a big difference. With Ubuntu you have to move your mouse pointer all the way up to the corner of the screen when the orange update button is visible. You have to click on that, look at the updates, click install and type in the administrator password.

        Whew! Mercy me I have the vapors from all that work. ;)

        [i]and most people - the overwhelming majority - will never encounter this bug.[/i]

        Oookay. But the few that do will have corrupt files...potentially important files...that may be unrecoverable. That seems like a pretty big handicap for a backup device.

        This may sound odd coming from me but I don't dislike MSFT or Windows. I used to hire former MS workers and they were super. Still are judging from your description of the reaction from the WHS team. And I support Windows and develop on that platform...although I think it's a clunky dev environment and I'm preparing to shift away from supporting MS. My issue with the company and their product line is value. Around ten years ago or so MS started coasting more than innovating. That trend has accelerated over the years. Now it seems as if MS is charging more and more while delivering less and less. Again, I'm speaking collectively, not individually.

        Okay, most users won't have a problem. I can point to some nifty OSS systems that have network storage capability and the majority of people won't have any problems using them, either. Seems kind of samey-samey considering the cost difference.
        Chad_z
        • Wrong difference

          I think the post was answering the reference to having to trawl through KBs. Compared to that either WU or the orange button is a trivial effort.

          Given the number of ANDs to reproduce this bug and its target market, I think it is low risk. I doubt that there would be any significant software that would not have at least one of such convoluted bugs: they just haven't had the problem scenario occur for someone.

          And what is the tie-in with Trend software? This 'bug' may be a result of several bugs in mutiple software. Finding it is one thing, fixing it is another (assuming the true cause is actually found).
          Patanjali