Proof that XML is extremely bloated

When I wrote this blog blasting the massive bulk of XML, our own John Carroll responded with a defense of XML for some situations.  John and I both agree that XML should not be used to handle large amounts of data because of increased storage and processing requirements, we only disagree where the cut-off should be.

When I wrote this blog blasting the massive bulk of XML, our own John Carroll responded with a defense of XML for some situations.  John and I both agree that XML should not be used to handle large amounts of data because of increased storage and processing requirements, we only disagree where the cut-off should be.  Some readers noted that XML word processing documents were actually smaller than Microsoft Word documents and this is true because the ratio of the XML tags to the size of the paragraphs is minimal.  To clarify my position, I was actually complaining about bloated XML spreadsheets and databases rather than word processing documents.

I quoted 1000% bloat in reference to spreadsheets and databases but one of our regular readers "Yagotta B. Kidding" pressed me to present some hard evidence.  Reader Patrick Jones responded by offering a spreadsheet that was stored as an 11 Megabyte XML file and as a 3 Megabyte XLS file.  Jones noted that the XML file was very compressible, 194 Kilobytes to be exact.  In that sense, one could argue that XML files can actually be smaller when ZIP is used, but I should warn that this may not be the best example because of the amount of redundant data in the sample that Jones provided.  We also have to take in to account that compression takes additional resources and we are essentially left with a binary file.  If you want to run your own experiments, I've zipped up a copy of this sample XML file and placed a copy of it here.

When I ran my tests, I noticed that the time it takes to open the 11 MB XML file was substantially longer than the time it took to open the 3 MB XLS native Microsoft Excel binary format.  But to get a more accurate measurement of the time difference, I decided to make the sample larger by making duplicate copies of the entire page within the sample XML file.  I did this by right clicking on the bottom tab to copy the entire sheet.  Once I had two sheets, I highlighted both sheets and made 4 and then 8 and then 16 sheets.  By the time I got to the 8th page, my laptop was getting a pretty good workout and getting to the 16th page almost locked up my laptop because it was beginning to run out of memory.  When I had the 4th duplication done, I had 16 sheets that ended up taking 193 Megabytes on the disk.  It took me 45 seconds to save this XML file to disk and opening this large XML file took 46 seconds.  I then decided to save the file in Microsoft's native XLS format and it took a mere 7 seconds and opening the file was even faster at 2 seconds.  Compression with IZArc took an additional 26 seconds and uncompressing the file took another 19 seconds.  Here is a break down of this simple little experiment.

FileSizeReadWrite
Native XLS binary50,995 KB2 sec7 sec
XML spreadsheet192,892 KB46 sec45 sec

Some of you will note that this particular XML file is only 3.78 times bigger than the XLS file, but this particular sample doesn't have that many fields which reduces the number of XML tags.  In this second sample, the XML version is 10.6 times bigger than the CSV file and 7.7 times bigger than the XLS version.  So the 1000% bloat figure I originally quoted might not always be true but it is indeed possible.  Using compression would solve the storage and transmission problems but it worsens the processing and memory requirements for using XML.

The bottom line is that the large sample XML file was excruciatingly slow and took more than 20 times longer to read and 6 times longer to write.  It doesn't matter if computers are faster today, the problem is that these are huge multiplier factors that greatly reduce the speed and capacity of any system no matter how fast it is.  In my business where I'm responsible for server and network architecture, these are huge concerns of mine.  People tend to forget that the purpose of better hardware is to get better performance and capacity, it's not so that we can maintain our performance and capacity because of bloated software.