XML as document format rules!!!

XML as document format rules!!!

Summary: George Ou thinks XML is an inefficient and unnecessary replacement for binary file formats. John Carroll disagrees, arguing that XML is the lego approach to managing data, and offers advantages in small-to-medium sized data management situations.

SHARE:
TOPICS: Browser
82

George Ou wrote a blog post last week where he poured cold water on XML as a data format. To summarize, XML is inefficient, fat, and unnecessary. Better to rely on more efficient binary formats, and leave programs to do what they are supposed to do, which is serve as interface between the binary world of computers and the visual world of humans.

He makes a good point, but I think it calls more for tempering our enthusiasm than serves as a reason to ditch XML as a data format. Don Box said something along those lines in a book on COM I read many years ago. Initiates to the world of COM may think, "wow, this is a great way to write software; Let's model EVERYTHING as a COM object." Well, that's a bad idea, because COM has overhead, and you can end up making you program bloated and slow as a result.

XML does have overhead. As a text-based format, it can increase storage requirements by 1000%, as Mr. Ou notes. Likewise, parsing it can take time, which means more CPU cycles spent handling XML. It's like the difference between a steak and a steak milkshake. One gets digested a lot faster (I don't know why that analogy popped into my head...just thought I'd creep everyone out).

The lesson from all this is that XML is inappropriate for large blocks of data. Databases will always store their data in the "easily digestible" binary format, saving space and processing time in an area where performance is critically important. Situations where large amounts of data are to be exchanged over a network are also poor candidates for a text-based data format. You wouldn't ever consider encoding video in a text format, even though it is theoretically possible.

Even Microsoft, which strongly supports use of XML in its software, understands the performance penalty of XML. Yukon, a.k.a. SQL Server 2005, integrates support for XML natively. You can define columns as XML, and associate schemas with those columns to validate the data inserted into them. However, that XML is NOT stored as a text string. It is parsed and stored in an efficient binary format better suited to the high-throughput and high-performance needs of a database environment.

XML, however, does have advantages that become more important in small-to-medium sized data management situations.

Which is easier to deconstruct, an arbitrary blob of matter or an arbitrary blob of legos? Clearly, legos are easier to untangle, because legos connect in a rigorously well-defined fashion.

The same applies to XML. XML is a standard means of representing data. There are literally piles of XML validation, parsing and management tools, and the technologies for manipulating XML are standardized and well-understood. XML, therefore, takes the lego-approach to representing data. That approach can be very useful.

One use is that XML can be read by humans. That matters to developers, but it also matters indirectly to consumers.

I can look at the contents of an XML file and figure out what's wrong with it, or else generate a file manually that I know works. Granted, this assumes that the file in question is designed to be readable by humans, as opposed to "obfuscated" in order to prevent deconstruction. For instance, the following construct would be unhelpful:

<a>
      <c34>Los Angeles</c34>
      <b12>65%</b12>
      <d>too high</d>
</a>

...whereas the following would be better:

<locationInformation>
      <city>Los Angeles</city>
      <actorWannabePercentage>65%</actorWannabePercentage>
      <housingPrices>too high</housingPrices>
</locationInformation>

Furthermore, I can tinker with individual settings, and do things which may not be possible in a piece of software designed to "manage" that XML file. For instance, I often have to change the names of Visual Studio projects, or else move them in the course of refactoring. Since the solution file (the one that links all the various projects so that they are displayed simultaneously in Visual Studio) is an XML-file, I can manually change these settings in each of the solution files that reference my renamed / relocated project, even though there isn't some Visual Studio operation that would do this for me. That wouldn't be an option with binary formats.

That's a developer / support person benefit, but developers / support people are often called upon to help customers with problems. Given that XML is so much easier a format to deal with by humans, it makes supporting customers that much easier. That will matter as long as humans are called upon to write software and support other humans.

Since George Ou's original post fell out of a discussion of document formats, a final question might be: is XML a suitable document format? I think it is, because MOST documents fall into the "small" to "medium" sized category. Granted, I've seen some really monstrous Excel spreadsheets, but that's usually a sign of improper use of Excel. Excel is not a database, even though some try to use it that way.

In such cases, though, I think it's better to compress the really large document files, should the need arise, than to do away with XML altogether, simply because XML, with all its tools, parsing and management technologies, and human-readability advantages, offer so many benefits in 99.9% of all cases.

Topic: Browser

John Carroll

About John Carroll

John Carroll has delivered his opinion on ZDNet since the last millennium. Since May 2008, he is no longer a Microsoft employee. He is currently working at a unified messaging-related startup.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

82 comments
Log in or register to join the discussion
  • Solution or stopgap?

    XML is being used widely not because it rules, but to solve problems involving communication of data.
    When XML has obvious flaws, it's difficult to think that something better won't be devised and implemented.

    You see the issues, but believe that the problems can be mitigated sufficently to allow XML to continue in the role designated for it.

    Why not consider it temporary, and thus a waste of time, unless the problems it solves are so severe that a temporary solution must be rushed into place and supported?
    Anton Philidor
    • How long is temporary?

      [i]Why not consider it temporary, and thus a waste of time, unless the problems it solves are so severe that a temporary solution must be rushed into place and supported?[/i]

      Barring plaintext, can you point to a data format that is still readable 40 years after it was introduced?

      If archival preservation of data is a strict requirement, do you have a suggestion that isn't some variant of "plain text?"
      Yagotta B. Kidding
  • Where XML should be used

    I'm still not so sure XML makes a good office document type. I think XML for Word processing is fine because the ratio of tags to actual paragraph data is minimal which results in minimal overhead.

    Although I agree that Excel is used as a database too often, there are plenty of situations where there are legitimate large Excel files. We should never dismiss efficiency because cost of hardware is going down. These things have a tendency to creep up on us. Some of the talk back in my original blog has pointed out that the new XML formats are already compressed. This would seem to defeat the entire ?human-friendly? (I never like that claim and it should be ?developer-friendly?) argument. That might fix the size issue, but it would seem like a waste of CPU cycles to bloat something and then compress it back down.
    george_ou
    • Question..

      isn't a binary format the same thing? Doesn't a binary format have to be "bloated" back up in order to be parsed by the software?

      As for "human friendly," a compressed file is still human friendly. The user can unzip the file and look at the contents. One cannot do that with a binary file format.
      Patrick Jones
      • Very funny

        "isn't a binary format the same thing? Doesn't a binary format have to be "bloated" back up in order to be parsed by the software?"

        ROTGL: No

        Everything is a tight binary format to begin with and it's a simple straight conversion to and back when you're dealing with binary formats. If you understood how computers and CPUs worked, this would be obvious to you. What you're asking for is a binary-ascii_XML-binary_zip. You can skip the second two phases altogether.
        george_ou
        • Not really

          The binary format has to be read in, parsed for rendering, then sent to the display for displaying. Yes, the binary format may be quicker. However, you don't just read in the binary format and it magically appears on the screen. Now, not knowing how Word does a save, I cannot say if it just "dumps its memory" to file. For some reason, I don't think it does. So, it would have to be converted from display format back to the binary save format. So, with on the fly compression, the display format to XML to zip could be one fell swoop just like the zip to XML to display format. You may lose a little speed in opening/closing, but I think an open, readable format is worth the tiny cost.
          Patrick Jones
          • Binary is orders of magnitude faster

            Of course it isn't a straight memory dump, but I can assure you it's orders of magnitude faster. From an assembler standpoint, even the conversion to an efficient ASCII format is bloated. Conversion to bloated XML and then ZIP is an extra 3 steps slower than a binary process.
            george_ou
          • Word doesn't create compressed XML files so I can't check it

            Using Word 2003 I opened two different files. A plain document, 23 pages, 250K. I saved it to XML, uncompressed, and it created a 500K document. Opening and closing both took the same amount of time. I then opend a 33M document that had pictures. It created a 45M uncompressed XML. Opening and closing both took about the same amount of time.

            Now, if I compress the first, it takes 2 seconds and creates a 40K file. If I zip the second, it takes about 20 seconds and creates a 35M file. So even if Word unzipped the file first, it would still only add 20-30 seconds to the time.

            I am not seeing orders of magnitude in speed difference. Granted, my tests are not completely scientific nor cover all possibilities.
            Patrick Jones
          • ONLY 20-30 seconds?

            Did you read what you wrote? "So even if Word unzipped the file first, it would still only add 20-30 seconds to the time."

            If you open 10 files a day, you're wasting 5 minutes waiting for the files to unzip. 5 minutes x 5 days/week x 52 weeks/year = 1300 minutes/year = roughly 21.7 hours/year waiting for files to unzip. No thanks.

            Granted, this assumes that you zip everything, but look at the storage requirement differences that you quote (250K current format, 500K XML--100% increase; 33 M current format, 45M XML--roughly 36% increase). Multiply that across every file you create. Storage is cheap, but not that cheap.

            I am seeing magnitudes of size and speed difference--George Ou is right on this one.
            tmurph1810
          • One day a year is not bad...

            I have some Word files that take 5-10 minutes to open. I have some AutoCAD files that take 10-15. Nothing is ever going to be instant. Those times are good for bathroom breaks, coffee breaks, and even ZDNet breaks :)
            Patrick Jones
          • Suggestion

            [i]Using Word 2003 I opened two different files. A plain document, 23 pages, 250K. I saved it to XML, uncompressed, and it created a 500K document. Opening and closing both took the same amount of time. I then opend a 33M document that had pictures. It created a 45M uncompressed XML. Opening and closing both took about the same amount of time.[/i]

            To complete the set, load them into OpenOffice.org and save as OO.o native, which is compressed XML.

            For extra credit, compare the 1.1.4 version and the latest 2.0 beta.
            Yagotta B. Kidding
          • Not even close to scientific

            Your testing isn't even close to Ball Park accurate. The times you clocked are mostly from other factors because the size of your file is too small.
            george_ou
          • A 30M file is too small?!?

            How is a 30M Word document too small? I would say that is fairly large for a Word file.
            Patrick Jones
    • Wrong question/focus

      The issue being addressed by XML as document format is portability.

      MA looked at their requirements for open government. Their data needs to be accessible and survive for many years and an untold number of system changes during that time.

      This being the case the issue/question wasn't "What's the best format for my word processing software to use?" The issue/question *IS* "What's the best format for my data so I can access it 10 years from now?"

      Disk space and CPU cycles are cheap these days. It is much better to consume and burn those instead of having to hire a collection of people every two years to migrate your files to the latest supported binary version of word-processor du-jour.
      Robert Crocker
      • Use some common sense

        "What's the best format for my data so I can access it 10 years from now"

        I can assure you it will still be .DOC and .XLS 10 years from now. It is the de-facto standard. Open Doc versus MS document format is like Esperanto versus English.
        george_ou
        • Perhaps...

          But dominent languages do change overtime. The advantage Esperanto has over English is stability and consistancy. There are quite a number of 'english' speakers who are effectively unable to communicate with other 'english' speakers. And I'm speaking of first language speakers too.

          Considering the speed of change in the IT industry, to make such a prediction 10 years out seems wishful thinking regardless of what is dominent now. Can I pose the question, how many versions of .DOC and .XLS currently exist?
          Zinoron
          • Wrong analogy

            Esperanto is currently a dead language (nobody uses it).

            Whatever the speed of change in the IT industry is, when a software maker (i.e. Microsoft) does changes in its file format (i.e. ?.DOC? for MSWord), just be assured that the new release of its software will be able to read (and sometimes even to write) the previous file format.

            Ask me why? Simply because this software maker will want to convert his customer base to its new release!

            Then your question with the number of ?.DOC? and ?.XLS? formats seems to me particularly irrelevant since I think that Microsoft (and I am far to be an advocate of the Redmond corporation) always provided new releases of its MSWord and Excel able to work with files produced with any of their respective previous releases.

            Just my two cents.
            Furball Tipster.
            furballtipster
          • Small-angle approximation

            [i]Then your question with the number of ?.DOC? and ?.XLS? formats seems to me particularly irrelevant since I think that Microsoft (and I am far to be an advocate of the Redmond corporation) always provided new releases of its MSWord and Excel able to work with files produced with any of their respective previous releases.[/i]

            Only the most recent versions, actually. Import from Office2K to OfficeXP is fairly good, from O97 is spotty, from O95 gets ugly, and from O6 or earlier you're better off reentering from hardcopy.

            In all cases you have to do some manual fixup to deal with changes in the semantics of rarely-used features like tables (As far as I know, MS has not once kept table margin definitions the same between releases of MSWord.)
            Yagotta B. Kidding
          • Your suggestions are ludicrous

            "Only the most recent versions, actually. Import from Office2K to OfficeXP is fairly good, from O97 is spotty, from O95 gets ugly, and from O6 or earlier you're better off reentering from hardcopy."

            What are you talking about? You seem to be fixated on manually reentering. Your comment isn't even worth defending. I never have any problem with older office documents. Even if there are a few minor formatting glitches, it hardly justifies manual re-entry.
            george_ou
          • George - you don't know what you are talking about

            "always provided new releases of its MSWord and Excel able to work with files produced with any of their respective previous releases."

            Just try a file produced by Word for Windows 3.1 and see how much of it is readible by Office 2003. See how much tweaking you have to do because the object model was changed.

            I've been a professional writer since before Windows was released, and I've been fighting the constantly shifting format changes, obsoleted formats and lack of backward readibility as long as I can remember.
            Tsu Dho Nimh