Proof that XML is extremely bloated

Proof that XML is extremely bloated

Summary: When I wrote this blog blasting the massive bulk of XML, our own John Carroll responded with a defense of XML for some situations.  John and I both agree that XML should not be used to handle large amounts of data because of increased storage and processing requirements, we only disagree where the cut-off should be.

SHARE:
TOPICS: Browser
102

When I wrote this blog blasting the massive bulk of XML, our own John Carroll responded with a defense of XML for some situations.  John and I both agree that XML should not be used to handle large amounts of data because of increased storage and processing requirements, we only disagree where the cut-off should be.  Some readers noted that XML word processing documents were actually smaller than Microsoft Word documents and this is true because the ratio of the XML tags to the size of the paragraphs is minimal.  To clarify my position, I was actually complaining about bloated XML spreadsheets and databases rather than word processing documents.

I quoted 1000% bloat in reference to spreadsheets and databases but one of our regular readers "Yagotta B. Kidding" pressed me to present some hard evidence.  Reader Patrick Jones responded by offering a spreadsheet that was stored as an 11 Megabyte XML file and as a 3 Megabyte XLS file.  Jones noted that the XML file was very compressible, 194 Kilobytes to be exact.  In that sense, one could argue that XML files can actually be smaller when ZIP is used, but I should warn that this may not be the best example because of the amount of redundant data in the sample that Jones provided.  We also have to take in to account that compression takes additional resources and we are essentially left with a binary file.  If you want to run your own experiments, I've zipped up a copy of this sample XML file and placed a copy of it here.

When I ran my tests, I noticed that the time it takes to open the 11 MB XML file was substantially longer than the time it took to open the 3 MB XLS native Microsoft Excel binary format.  But to get a more accurate measurement of the time difference, I decided to make the sample larger by making duplicate copies of the entire page within the sample XML file.  I did this by right clicking on the bottom tab to copy the entire sheet.  Once I had two sheets, I highlighted both sheets and made 4 and then 8 and then 16 sheets.  By the time I got to the 8th page, my laptop was getting a pretty good workout and getting to the 16th page almost locked up my laptop because it was beginning to run out of memory.  When I had the 4th duplication done, I had 16 sheets that ended up taking 193 Megabytes on the disk.  It took me 45 seconds to save this XML file to disk and opening this large XML file took 46 seconds.  I then decided to save the file in Microsoft's native XLS format and it took a mere 7 seconds and opening the file was even faster at 2 seconds.  Compression with IZArc took an additional 26 seconds and uncompressing the file took another 19 seconds.  Here is a break down of this simple little experiment.

FileSizeReadWrite
Native XLS binary50,995 KB2 sec7 sec
XML spreadsheet192,892 KB46 sec45 sec

Some of you will note that this particular XML file is only 3.78 times bigger than the XLS file, but this particular sample doesn't have that many fields which reduces the number of XML tags.  In this second sample, the XML version is 10.6 times bigger than the CSV file and 7.7 times bigger than the XLS version.  So the 1000% bloat figure I originally quoted might not always be true but it is indeed possible.  Using compression would solve the storage and transmission problems but it worsens the processing and memory requirements for using XML.

The bottom line is that the large sample XML file was excruciatingly slow and took more than 20 times longer to read and 6 times longer to write.  It doesn't matter if computers are faster today, the problem is that these are huge multiplier factors that greatly reduce the speed and capacity of any system no matter how fast it is.  In my business where I'm responsible for server and network architecture, these are huge concerns of mine.  People tend to forget that the purpose of better hardware is to get better performance and capacity, it's not so that we can maintain our performance and capacity because of bloated software.

Topic: Browser

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

102 comments
Log in or register to join the discussion
  • Just for comparison

    Would you please put the humongous XLS somewhere for download? I'd like to run comparisons using OpenOffice.org (for instance) and don't want to get into dueling datasets at this point.

    Note that if you want a maximum-entropy spreadsheet, it's possible to populate one with not only random data but random data types. I'll crank one up if you like.
    Yagotta B. Kidding
    • Just generate it yourself

      Extract the CSV file and save it as an XML file. Highlight the bottom sheet tab and hit "move or copy" and make a copy. Then highlight both tabs and make a copy and you will have four. Then highlight all four make 8 and then repeat one more time to get 16.

      Sorry I can't post one, because I don't want to post a massive file that eats up all my bandwidth. If you have an FTP server you want me to send it to, just email me.

      The second sample I posted in the blog has maximum entropy because I used a random number generator to populate the sheet.
      george_ou
    • Actually, get it here

      http://www.lanarchitect.net/Examples/200264-l.zip

      I forgot it compressed so much. Here is a 2 MB file.
      george_ou
    • Th eproblem is with the software, not XML

      Everybody (including the inventors) know XML is bloated but note that it's also very compressable. The thinking behind it is that it <i>should</i> be compressed for transmission/archival.

      As for slow loading...blame the application, not the file format. Ok, XML is always going to be a bit slower because it's ASCII but if it's 20 times slower then the file reading code needs looking at.

      Then there's always the sneaking suspicion that Microsoft is perfectly capable of making open file formats load slowly on purpose... Cue scary music and black helicopters.
      figgle
      • Don't even go there

        "Then there's always the sneaking suspicion that Microsoft is perfectly capable of making open file formats load slowly on purpose"

        Don't even go there. Microsoft can't open Open Office File formats. This is Microsoft's XML format. Open Office took twice as long to open an XML file 1/2 the size based on Yagotta Be Kidding's numbers. I'm still waiting for him to post the full 16-sheet numbers.

        Now if you want to say this isn't fair because it isn't in the Open Office format, go ahead and save/open this 16-sheet spreadsheet in native Open Office format and post the times here. I suspect it will be even slower because of the compression.
        george_ou
  • well said !

    very well said ! I hate XML. XML is the evil of all.
    mbraincell9
    • The trouble with generalizing is.....

      Showing that one program (ie. Excel) loads XML slowly isn't the same thing as proving that all XML files are slow.

      If I was a bit more paranoid I might think that Microsoft deliberately [b]de[/b]-optimized their "open format" file loaders.
      figgle
      • You're not paranoid, you're just wrong

        "If I was a bit more paranoid I might think that Microsoft deliberately de-optimized their "open format" file loaders."

        You're not paranoid, you're just wrong. Microsoft doesn't have any Open Format loaders. Now if you don't like this, go ahead and try to answer my challenge in your previous post.
        http://www.zdnet.com/5208-10533-0.html?forumID=1&threadID=13257&messageID=266631&start=1
        george_ou
  • The cut-off point is relative

    It all depends on where the cost of the overhead associated with parsing and managing XML outweighs the benefits derived from the "lego" qualities of XML. I would never want a database that stored its data in XML.

    Flexibility might also be a factor to take into consideration. If you need to do lots of weird, task-specific things for which no UI operation exists, then you might be willing to let the XML file get a bit bigger in the interest of keeping the flexibility benefits.

    The same cost/benefit analysis applies to middleware platforms like Java and .NET. All those nifty productive-enhancing things (security verification and garbage collection handled for you, etc.) have a cost, and in some cases, that cost is too high. That's why C/C++ will stay the workhorse of performance applications.

    There was a day, though, when those looking for REAL performance fell back to assembly language. These days, processors are so fast that those performance gains aren't sufficient to justify the opacity of assembler code. I expect that as processors get faster, the same will apply to XML. I don't expect to EVER see databases storing everything in XML, but I can see more and more people / sites getting comfortable with processing very large office documents (100MB and bigger), simply because the cost has gone down due to faster processing speeds.
    John Carroll
    • This is scary

      John and I are in complete agreement.
      Yagotta B. Kidding
    • But why not a binary format?

      "It all depends on where the cost of the overhead associated with parsing and managing XML outweighs the benefits derived from the "lego" qualities of XML"

      Why can't this be done with a binary format where integers are stored as integers and tags are kept to a minimum? Why does it have to be so verbose? Is it really worth 10 times the memory, processing, network traffic, storage? Doesn't .NET handle SQL databases extremely well?

      "There was a day, though, when those looking for REAL performance fell back to assembly language"

      That still applies to game and audio/video processing. Remember, we're talking about multiplier factors here so it doesn't matter how fast and cheap the processor gets. You can never have too much resolution and detail level.

      It?s easy to justify bloated software on servers because they?re so cheap and fast now, but I?ve always felt that software lag is one of the most annoying things to users. You can never have too much speed and capacity on a server. All too often developers have a tendency to rationalize bloat with hardware gains.
      george_ou
      • Never too much

        [i]You can never have too much resolution and detail level.[/i]

        Actually, yes you can. The human vision system has a finite resolution after all. If I recall correctly, it's somewhere in the 3000x3000 pixel range, with about 10 bits of grayscale and rather less than 16 bits total of color discrimination.

        In other words, we're not far away from the practical limit of display resolution.
        Yagotta B. Kidding
        • nonsense

          Where do you get this arbitrary 3000x3000 range? 3000x3000 pixels might be enough to cover a certain range in viewable angle.

          The thing is, the human eye has very high resolution at the center but mostly blurry at the edges. Try staring at one letter on the screen and see if you can read anything else on the screen without moving your eyes. You can't! The point is, a screen can be 10,000 pixels wide and it still won't be enough because our eyes can essentially zoom in on any part of the screen.

          Now if we were able to "inject" an optical stream directly in to the eyes or optic nerve, we could probably get away with a few hundred kbps so long as we can replicate detail where it is needed and the image changes as soon as we refocus our eyes or move them.

          It will be a long time before we have enough processing power to replicate ?the Matrix? to totally immerse all of our senses to fool us in to believing in a virtual reality. There would also need to be some other major advances in direct brain interface. Efficient code will always be valued.
          george_ou
          • Pixel count

            [i]Where do you get this arbitrary 3000x3000 range? 3000x3000 pixels might be enough to cover a certain range in viewable angle.[/i]

            That's about the visual resolution of the entire eye.

            Panning, of course, means that if you want to you could cover your entire office on all surfaces with several hundred dpi, but there's really no point in it since you can only grab a few megapixels at a time.

            Once you get to the point of panning, there are any number of solutions that are more ergonomic than increasing the size of the display.

            Note, BTW, that there is most emphatically a limit to the benefit of increased resolution. Optical health dictates a screen-to-eye separation of at least 45 cm and the fovea has a limited rod/cone density that maps to a pixel density at that distance. Likewise, there's a seriousl limit to how much space you want that display to occupy.

            Direct retinal projection avoids those issues, but then you're right back to sensing panning at the eye level.
            Yagotta B. Kidding
          • Panning is the point

            You wouldn't even need 3000x3000 pixels if you instantly changed the image as soon as the eyes refocus or pan. It would need to be very sharp at the center and very vague at the edges. So long as you could produce a new image ever 1/90th of a second, it should look very realistic. The point is that 3000x3000 is not a meaningful number. Movie theaters are moving to 4K projectors.
            george_ou
      • Tradeoffs

        [i]Why can't this be done with a binary format where integers are stored as integers and tags are kept to a minimum?[/i]

        As John points out, it's a tradeoff between machine efficiency (the binary file uses less machine resources) and human efficiency (the XML is more reusable, easier to test, can reuse parsers, etc.)

        Put another way, native binary data streams are like programming in raw binary. XML is like programming in a high-level language. Such languages can either be interpreted (the usual practice today) or compiled; so far TPTB have resisted compiled XML for a number of reasons.

        Once upon a time, programmers carefully calculated where on the drum memory an instruction got loaded so that branches would execute without taking an extra turn of the drum. Later the machines got cheaper and the humans found other things to do.

        Were I you, George, I'd switch tactics from ranting about the inefficiencies of XML [i]per se[/i] to lobbying for XML compilers on the grounds that they preserve the human benefits of XML as a data-description language but remove the recurring costs as a data language.
        Yagotta B. Kidding
      • Processors speed

        The original IBM PC came out with, I think, a 4.66mhz cpu.
        I now have a 2.66ghz laptop. My computer is not 500,000 times faster. Why?

        Bloatware, bad code, useless features. All in the name of functionality.
        Does is do wrod processing faster?
        Spreadsheets faster?

        I could boot up in 30 secs. on Windows 3.11
        Now it takes 2mins.
        Granted, my pc does more. But hey it's still just networking with another computer and internet.
        Shouldn't my laptop boot Excel so faaasssst that it would be instantaneous? Not just fast. How about Notepad?
        Notepad should be up in nanoseconds.
        Now add in the XML crap. Faster processors will help, but still you have bloatware.
        There has to be a better way to exchange proprietary data.
        As I said previously, EDI was compact. Now you could modify that standard and come up with something light years better than XML.
        Easy to read. Who cares. You should have to read it. And why would the average person need to. The only one concerned would be the systems integrator. Well if he/she is the only one, then why not something else?

        IMHO
        Ramien
        herb643
  • x8

    Using George's method, I made a copy of the original with eight sheets:

    -rw-r----- 1 ykidding foo 3348992 Sep 9 13:27 /tmp/200264.xls
    -rw-r----- 1 ykidding foo 1890013 Sep 9 13:32 /tmp/200264x8.sxc
    -rw-r----- 1 ykidding foo 32170496 Sep 9 13:40 /tmp/200264x8.xls

    It would appear that XLS runs to about 17 times the "bloat factor" of compressed XML.

    However, George is right about the time factor: the XLS loaded in 28.7 seconds while the SXI took 97.7 seconds.
    Yagotta B. Kidding
    • Compression usually isn't the answer

      Most normal users never compress their files.

      Compression is also very slow on large files.

      If it's compressed, so much for opening it up in note pad.

      Just because it's compressed on disk, it doesn't mean it's compressed in RAM. You'll severely limit the capacity of a server to handle XML processing.
      george_ou
      • Notepad

        Actually, with the new zip support in Explorer (assuming that is the compression used), you can open the file in notepad without ever having to manually unzip it. If it is another compression, you can open it in that program and then view it in notepad.
        Patrick Jones