OASIS? XML? Permanence?

Microsoft's mind share means that whatever it calls XML becomes XML.
Written by Paul Murphy, Contributor

My favorite CD has Ormandy's Philadephia orchestra playing the Shostakovich fifth. The recording was made in (I think) 1963 with the CD replacing the LP sometime in the early ninties. Microsoft's mindshare means that whatever it calls XML becomes XML.Both, however, still work as well as they ever did with the CD clearer but colder than the record. In contrast I've got a 3480 tape cartridge somewhere with several hundred functionally unrecoverable documents I worked on for an Alberta government department in the mid to late eighties. They required the use of a proprietory IBM word processor on PC-DOS and even if I had the software and the drive, I don't think the tape would still be readable. Those documents are lost.

For my own use I get around this problem by doing most things first as unformatted text and, when I use FrameMaker instead of vi, making parallel backup copies of my FrameMaker files using its text output function. Unfortunately that's not a solution I can recommend to clients whose ideas about word processors start and end with Microsoft Word. So what can I suggest to them?

Is XML, in any variant including the OASIS and related technologies, part of the answer?

Notice that software and formatting are only part of the problem - tape storage has problems because of print through, CDs and DVDs lose information over time, so do standard disks. Yes, you can buy drives and media certified for 50 years, but no one's had them for even 15 years, so how much do you want to bet on that stuff?

It's the software stability issue that's the killer here. Get everybody in the organization using the same word processing tools and the problem takes care of itself, right? Wrong. Openoffice.org actually opens and correctly formats more kinds of Microsoft Word documents then Microsoft Word does - and the more rapid and adaptive change becomes, the worse this problem gets.

So is XML likely to go the same way? In principle, no; in practice, I think so.

In principle, XML is just a set of rules for the derivation of a document type definition [DTD] that is then used to describe document formatting. Thus SAML is an XML examplar and a tool, like FrameMaker, that can import an XML-compliant DTD like SAML, which  can be used to recover both the original text and its look and feel.

In practice, however, Microsoft's mind share means that whatever it calls XML becomes XML. They've apparently given up on their first idea of turning XML into a web programming language, but that doesn't mean we'll get stability in document formatting. Indeed, we've already seen Word's basic DTDs evolve and accrete as Microsoft's needs and goals changed. Already, for example, people who bought into the lockable and one read document ideas on offer in 2001 face hard choices: write these documents off as unrecoverable or face the cost of having someone read them using old software and then write them using the newer product.

Don't misunderstand, this isn't a Microsoft or even a PC issue: the problem is that XML is quite stable in principle but not in practice. In practice, specific DTDs change over time or the technology needed to process them changes or disappears. Thus in Word's case, it's the DTDs that disappear or change over time while the WordPerfect case I mentioned on Monday, although not XML, illustrates what happens when technology and our access to it changes.

Either way, of course, our cost of recovering the stored information and formatting rises with each change and whether that change takes place in the technology or the market doesn't really matter. At some point that curve goes exponential and it first becomes impractical, and then functionally impossible to recover the information.

So what do you tell a client legally required to keep documents on file and accessible for a minimum of 60  years? I'm thinking of killing the problem with hardware and knocking off document search in the process - on which more tomorrow.

Editorial standards