How Microsoft puts your data at risk
Summary: 56% of data loss due to system & hardware problems - OntrackData loss is painful and all too common. Why?
56% of data loss due to system & hardware problems - Ontrack Data loss is painful and all too common. Why? Because your file system stinks. Microsoft's NTFS (used in XP & Vista) with its de facto monopoly is the worst offender. But Apple and Linux aren't any better.
Everyone knows what the problems are AND high-end systems fixed many of them years ago. Yet only one desktop vendor is moving forward, and they aren't based in Redmond. Here's the scoop.
Y2k got fixed. File systems didn't. That may sound harsh. But with all the lip-service paid to innovation - especially in Redmond - you'd think that sometimes we'd see some, especially in core technology. After all, more than half of all data loss is caused by system and hardware problems that the file system could recover from - but doesn't.
Instead we're using 20 year old technology that, like the 2 digit year - which led to the Y2K drama - was designed for a world of scarce storage, small disks and limited CPU power. Unlike Y2K though, we are living with, and paying for, these compromises every day with lost data, corrupted files, lame RAID solutions and hinky backup products that seem to fail almost as often as they work.
File systems? I should care because . . . You rely on your file system every time you save or retrieve a document. It is the file system that keeps track of all the information on your computer. If the file system barfs, your data is the victim. And you get to pick up the pieces.
As documented in my last two posts (see How data gets lost and 50 ways to lose your data) PC and commodity server storage stacks are prone to data corruption and loss, many of them silent. Only your file system is positioned to see and fix these problems. It doesn't, of course, but it could.
And you enterprise data center folks, smirking over the junk consumers get, don't be too smug. Some of your costly high-end storage servers have NTFS or Linux FS's under the hood as well. And no, RAID doesn't fix these problems. According to Kroll Ontrack, only a quarter of data loss instances are due to human error - and many of those errors happen in the panic after a loss is discovered.
Hey, I thought machines were supposed to be good at keeping track of stuff? Only if they are built to.
IRON = Internal RObustNess I came across the fascinating PhD thesis of Vijayan Prabhakaran, IRON File Systems which analyzes how five commodity journaling file systems - NTFS, ext3, ReiserFS, JFS and XFS - handle storage problems.
In a nutshell he found that the all the file systems have
. . . failure policies that are often inconsistent, sometimes buggy, and generally inadequate in their ability to recover from partial disk failures.
Dr. Prabhakaran will see you now In a mere 155 pages of lucid prose he lays out his analysis of the interaction between hosts and local file systems. It is a clever analysis, especially of the proprietary and unpublished NTFS.
First, inject a lot of errors Dr. Prabhakaran built an error-injection framework that enabled him to control what kind of errors the file system would see so he could document how the FS handled them. These errors include:
- Failure type: read or write? If read: latent sector fault or block corruption. Does the machine crash before or after certain block failures"
- Block type: directory block; super block? Specific inode or block numbers could be specified as well.
- Transient or permanent fault?
So how did NTFS fare? Since NTFS is proprietary, Dr. Prabhakaran couldn't get as deeply into it as the open-source systems. While NTFS doesn't implement the strongest form of journaling, he found it pretty reliable at letting applications know when an I/O error has occurred. NTFS also retries I/O requests more than the Linux file systems, which, compared to the dearth of retries on Linux, is a good thing.
NTFS sanity checking is also stronger than some. Yet he notes that
NTFS surprisingly does not always perform sanity checking; for example, a corrupted block pointer can point to important system structures and hence corrupt them when the block pointed to is updated.
Translation: Bad Thing.
General screw-ups Dr. Prabhakaran offered a set of general conclusions about the commodity file systems including NTFS:
- "Detection and Recovery: Bugs are common. We also found numerous bugs across the file systems we tested, some of which are serious, and many of which are not found by other sophisticated techniques."
- "Detection: Sanity checking is of limited utility. Many of the file systems use sanity checking . . . . However, modern disk failure modes such as misdirected and phantom writes lead to cases where . . . [a] bad block thus passes sanity checks, is used, and can corrupt the file system. Indeed, all file systems we tested exhibit this behavior."
- "Recovery: Automatic repair is rare. Automatic repair is used rarely by the file systems; . . . most of the file systems require manual intervention . . . (i.e., running fsck)."
- "Detection and Recovery: Redundancy is not used. . . . [P]erhaps most importantly, while virtually all file systems include some machinery to detect disk failures, none of them apply redundancy to enable recovery from such failures."
Dr. Prabhakaran found that ALL the file systems shared
. . . ad hoc failure handling and a great deal of illogical inconsistency in failure policy . . . such inconsistency leads to substantially different detection and recovery strategies under similar fault scenarios, resulting in unpredictable and often undesirable fault-handling strategies. . . . We observe little tolerance to transient failures; . . . . none of the file systems can recover from partial disk failures, due to a lack of in-disk redundancy.
How doomed are we? Pretty doomed. But there is some hope.
There are well known techniques, such as disk scrubbing, check summing, and more robust ECC used in high-end systems that could be added to our systems. Not rocket science.
Young Dr. Prabhakaran now works at Microsoft Research. Perhaps someone up in Redmond will reach out to him to see how NTFS's aging architecture might be enhanced.
Of course, Microsoft is fine with the status quo until it threatens market share. Internet Explorer's innovation hiatus after crushing Netscape is a fine example.
So it is good news that Apple has two storage initiatives that will put pressure on Redmond to clean up its act.
- Time Machine is a beautifully crafted automatic backup utility in Mac OS X.V (Leopard). While it doesn't solve the data corruption problems that I assume HFS+ has as well, it does make it very easy for regular folks to backup and recover their data. I think small business types will love it.
- ZFS is the new open-source file system from Sun that Apple is incorporating into OS X. I expect the port won't be complete for another year, but ZFS is the first file system to offer end-to-end data integrity that can detect and correct such devious problems as phantom writes.
See Apple’s new kick-butt file system for more on ZFS.
The Storage Bits take As noted in "How data gets lost" more than half of all data loss is caused by system and hardware problems. A high quality file system that took better care of our data could eliminate many of those failures.
The industry knows how to fix the problems. The question is when. With a resurgent Mac pushing ZFS maybe Redmond will see the light sooner, rather than later, and dramatically increase the reliability of all our systems.
It will be interesting to see how Microsofties spin inferior data integrity once ZFS is the OS X default file system. Especially to the enterprise folks for whom data integrity is the ne plus ultra of the data center.
Comments welcome, of course. Itching to read a well done CompSci PhD. thesis? Here's a link to IRON File Systems. Enjoy.
Update: based on the first couple of commenters, who seem to believe that data loss is a figment of my imagination, I gave more prominence to the factual basis of data loss and added a couple of short quotes from the thesis. I single out Microsoft because their negligence impacts more people than any other company. Maybe, someday, Microsoft will start measuring success in terms of software quality instead of market share.
Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.
Talkback
Yawn
As a computer user since the 60s, all that has happened is that both hardware and software have become more reliable. You manage to mention NTFS disparagingly even when it outperforms the other file system and with FAT32 is the most commonly used file system on the planet. It's one thing to build theoretical models, but where is the data that suggest we are losing information all over the place?
And of course Apple (again rebadging someone else's technology) is going to save the day. Just like it's totally unexploitable operating system and it's one-button mouse. Well it's still not here is it? Frankly, I prefer the expertise of a company whose OS is used by over 90% of the world to one that borrows other people's work for their small audience.
How about some facts about all the data we are losing?
OK, 56% of data loss due to system & hardware problems
If you are as experienced as you say - which I doubt - should know that Murphy's
Law is operative here.
Nor was this a theoretical exercise: he did real fault injection and found real
problems. And Microsoft Research hired him based on that work. I thought you
respected Microsoft?
And no, Apple won't save the day, but competition will. Isn't that the American
way? Seems to work in other industries.
Robin
Competition????
hmmm. Did you see this TITLE!!
I wish the bloggers who are trying to do their part to dent Microsoft at every chance would just come out and say it. This story was obvious, but some are more subtle and I think it's time even bloggers just stated their stance.
Not entirely true
An advantage to open source is that it's tough for Microsoft to actually buy anything. Which forces them to other methods, like the nebulous claims of patent infringement. But if could allow one to infer that they're nervous.
Is this your only exposure ...
My only wish . . .
Gee...
Now, if you want facts about data we are losing, here a little real life facts for you... Since 2000 I have been directly involved with a company that has lost data from their Exchange server on 3 occasions. Twice were a result of hardware failure and once was a result of "data corruption due to unknown reasons" per MS support. The scope of the problem was mitigated after the first incident of data loss by initiating incremental backups of the email system several times during day. Problem is, the loss of even an hours worth of emails can be devastating to a company that relies on it as a communications vehicle (and these days, who doesn't?). One thing to note is that it's not the billion dollar a year company that typically sees this kind of problem...it's the small company with revenues of less than say $10 million. They don't have the resources to tackle problems like this that big corporations typically have.
It doesn't out perform ReiserFS 4
How it compares to ZFS, I don't know. But since I have been using ReiserFS 4 I have noticed at least a 15% increase in overall read write performance. And this is noted by using the hdparm -Tt and comparing it to the ReiserFS 3.6 that was originally formatted on my drives.
One thing though, I haven't had any serious data loss or corruption in the last couple years using ReiserFS. I have with ext2 and ext3 but not ReiserFS. ]:)
The other powerful attribute is the lack of needing to defragment the drive, something that all Windows systems seem to need, even the NTFS system. Gotta love the *nix based FS... takes all the maintenance out of the equation for you! ]:)
RE: ... ReiserFS 4
In one situation ReiserFS v3 was on a bulletin board system that allowed tax payers to phone in and check on the status of their returns. The system was up 24/7 for 18 months without the loss of a single byte of data, except when ReiserFS was called upon twice during that period to recover data during reboot when Squirrels short-circuited power lines and brought the entire building down. The APC failed on both occasions, too, so I don't use them any more.
APC or UPS in general?
We've had nothing but problems with APC - just fails when called upon. Of course they always seem to have some explanation, but that completely overlooks the idea that this is the failsafe backup to prevent power loss - so if they can't work out the bugs and I still have a reasonable chance of losing power when utility power dies, what's the point of spending all that money to ensure I never lose power? They sputter and struggle, but never come up with a reply.
But who do you go to? APC seems to be a bigger monopoly than Microsoft.
Haven't used that fs
Good theory, stick with it
Too late...
]:)
WGA has no place in a business!
Operating Systems have become commodity items. Using license keys or serial numbers for an OS as if it were a $25,000 engineering package is laughable.
Let's not kid ourselves here - Microsoft says they are doing this to combat piracy - Unless WGA is 100% ineffective, they've reduced piracy by some unknown percentage.
How much has the price of Windows dropped since they introduced WGA again?
Lastly, WGA precludes windows from being used in a business. Right now, with reduced functionality, you may still have use of your computer. What's to stop Microsoft from ratcheting this down?
-Mike
Rebadging
Good for Apple. Isn't that the point of open systems? Why create another proprietary system?
And haven't those fellas in Redmond gone out and bought a whole slew of technologies they didn't invent? The difference is that once they rebadge something, it becomes proprietary.
Yes, MS does that all the time.
They do HAVE developed one product them self, and that is the original MS Basic that was in ROM on the first IBM PC. That was the one that Mr B. Gates himself was one of the coders on. Developed on a Unix system, if I not remember wrong (which I might on this last part).
Real issues
Fortunately, I ran across a program called Handy Recovery which helped minimize the loss. Even though I have difficulty recommending anything that uses any form of product activation. And Spinrite 6 verified that the drives were not affected by any physical problems. So it was apparently a problem in the file system.
I too thought that NTFS was very robust, but I'd never had that kind of catastrophic failure with FAT or FAT32, even a couple of times when disk compression was involved. Having 90% of the market proves nothing, since most users I know and even some of the IT people don't really understand that much about computers. It's a black box to them. They use what's available in their price range or what they're given. Most would still be using Win98, if factors didn't push them to newer versions.
I use best of breed software, rather than relying on a single source solution. No reason for Apple not to do the same. And it's not as if Microsoft has never done it. Have you noticed how many companies and software programs they've bought or co-opted over the years? They didn't even develop the original version of DOS. They have probably used far more of other people's work than Apple has. Or ever will. Not that I like Apple as a company, but the truth is what it is.
BS -Mr Harris
Sure, there are better file systems in the works. I'm pretty sure Microsoft has some proprietary ones in the works too that you are not privy to.
In any event, only an idiot would conclude the file system puts your data at risk. Aside from a serious system crash, what puts your data at risk are external factors like power failure, drive failure, drive damage from moving a powered-on external USB drive, etc.
Not withstanding the dramatic BS headline, having a 10-15 minute UPS, with at least a mirrored array = NO PROBLEMS whether running NTFS or EXT3 or Reiser. In that 1-million event where the journaling file system fails, you will fall back on your backup (which you should be doing anyway to protect against hardware failures).
A total BS story. You should find something useful to blog about.
Go back and read the article
<br>
<br>
If you haven't had any data corruption count yourself lucky, not smart.
<br>
<br>
Also, I didn't say Microsoft was worse than the others - they are on a par. They all
stink. But because they are the biggest vendor, they put more data at risk than any
other vendor.<br>
<br>
Isn't that obvious?
<br>
<br>
Robin