X
Business

XML as document format rules!!!

George Ou thinks XML is an inefficient and unnecessary replacement for binary file formats. John Carroll disagrees, arguing that XML is the lego approach to managing data, and offers advantages in small-to-medium sized data management situations.
Written by John Carroll, Contributor

George Ou wrote a blog post last week where he poured cold water on XML as a data format. To summarize, XML is inefficient, fat, and unnecessary. Better to rely on more efficient binary formats, and leave programs to do what they are supposed to do, which is serve as interface between the binary world of computers and the visual world of humans.

He makes a good point, but I think it calls more for tempering our enthusiasm than serves as a reason to ditch XML as a data format. Don Box said something along those lines in a book on COM I read many years ago. Initiates to the world of COM may think, "wow, this is a great way to write software; Let's model EVERYTHING as a COM object." Well, that's a bad idea, because COM has overhead, and you can end up making you program bloated and slow as a result.

XML does have overhead. As a text-based format, it can increase storage requirements by 1000%, as Mr. Ou notes. Likewise, parsing it can take time, which means more CPU cycles spent handling XML. It's like the difference between a steak and a steak milkshake. One gets digested a lot faster (I don't know why that analogy popped into my head...just thought I'd creep everyone out).

The lesson from all this is that XML is inappropriate for large blocks of data. Databases will always store their data in the "easily digestible" binary format, saving space and processing time in an area where performance is critically important. Situations where large amounts of data are to be exchanged over a network are also poor candidates for a text-based data format. You wouldn't ever consider encoding video in a text format, even though it is theoretically possible.

Even Microsoft, which strongly supports use of XML in its software, understands the performance penalty of XML. Yukon, a.k.a. SQL Server 2005, integrates support for XML natively. You can define columns as XML, and associate schemas with those columns to validate the data inserted into them. However, that XML is NOT stored as a text string. It is parsed and stored in an efficient binary format better suited to the high-throughput and high-performance needs of a database environment.

XML, however, does have advantages that become more important in small-to-medium sized data management situations.

Which is easier to deconstruct, an arbitrary blob of matter or an arbitrary blob of legos? Clearly, legos are easier to untangle, because legos connect in a rigorously well-defined fashion.

The same applies to XML. XML is a standard means of representing data. There are literally piles of XML validation, parsing and management tools, and the technologies for manipulating XML are standardized and well-understood. XML, therefore, takes the lego-approach to representing data. That approach can be very useful.

One use is that XML can be read by humans. That matters to developers, but it also matters indirectly to consumers.

I can look at the contents of an XML file and figure out what's wrong with it, or else generate a file manually that I know works. Granted, this assumes that the file in question is designed to be readable by humans, as opposed to "obfuscated" in order to prevent deconstruction. For instance, the following construct would be unhelpful:

<a>
      <c34>Los Angeles</c34>
      <b12>65%</b12>
      <d>too high</d>
</a>

...whereas the following would be better:

<locationInformation>
      <city>Los Angeles</city>
      <actorWannabePercentage>65%</actorWannabePercentage>
      <housingPrices>too high</housingPrices>
</locationInformation>

Furthermore, I can tinker with individual settings, and do things which may not be possible in a piece of software designed to "manage" that XML file. For instance, I often have to change the names of Visual Studio projects, or else move them in the course of refactoring. Since the solution file (the one that links all the various projects so that they are displayed simultaneously in Visual Studio) is an XML-file, I can manually change these settings in each of the solution files that reference my renamed / relocated project, even though there isn't some Visual Studio operation that would do this for me. That wouldn't be an option with binary formats.

That's a developer / support person benefit, but developers / support people are often called upon to help customers with problems. Given that XML is so much easier a format to deal with by humans, it makes supporting customers that much easier. That will matter as long as humans are called upon to write software and support other humans.

Since George Ou's original post fell out of a discussion of document formats, a final question might be: is XML a suitable document format? I think it is, because MOST documents fall into the "small" to "medium" sized category. Granted, I've seen some really monstrous Excel spreadsheets, but that's usually a sign of improper use of Excel. Excel is not a database, even though some try to use it that way.

In such cases, though, I think it's better to compress the really large document files, should the need arise, than to do away with XML altogether, simply because XML, with all its tools, parsing and management technologies, and human-readability advantages, offer so many benefits in 99.9% of all cases.

Editorial standards