X
Business

Update your XML knowledge -- Part 4

Where do you store XML documents? Read on in our final part of the XML series ...
Written by Richard Goh, Contributor
Repositories
Since the release of Extensible Markup Language (XML) as a standard, it has gained popularity and has begun to be the format of choice for a variety of data types, especially documents. With the ability to tag the text data within the document, XML makes searching simpler and more dynamic. Moreover, as XML is independent of the presentation form, it allows for greater reuse of material, allowing for the same content to be converted to press releases, white papers, brochures and Web pages. XML has also been used in enterprise system as the common language spoken between systems.

With such a wide variety of use for manipulation of data online and offline, a looming problem surfaces over the management and storage these XML documents. One possible solution is the use of databases to store, retrieve and manipulate XML documents. The idea is to place these XML documents into an environment where searching, analysis, updating and output can be done in a more manageable, systematic and well defined manner. The database becomes a natural choice as it is an environment that not only offers these features, but is well understood by the generations of programmers who have been working with databases. The challenge however comes in answering the question of whether what we had previously in databases was enough for XML documents.

There are some questions that need to be asked when thinking about XML and databases. The most important is of course why you would want to use a database in the first place? This is important as the decision to use a database could mean more complication your applications and naturally more cost. Is it enough to just use a simple file system to store your XML documents? Do you really require complicated search facilities? All these are important questions. After determining the need for storing XML in databases, one would then look at the types of XML document that are to be stored. This would determine the type of database to be used by your application.

Broadly, XML documents can be categorized into two types - data-centric XML documents and document-centric documents. Data centric XML documents are generally designed for machine readability. The data represented by the XML document are fairly regularly structured, fine-grained and has little or no mixed content. On the other hand, document centric XML documents are designed for the purpose of presentations to human. They are characterized by less regular or irregular structure, larger grained data and lots of mixed content. Overall, it is sometimes difficult to determine the type of XML documents you have as most documents are somewhere in between the two categories. However, separating the to documents will help in determining the type of XML repositories or storage you may choose as we shall see later.

Generally, there are two types of XML databases, the XML enabled databases and the native XML databases. The XML enabled databases are fundamentally RDBMS or ODBMS or any of the other databases that provides an interface to convert the data from the underlying format into XML. Native XML databases are built internally to store data as XML itself. We shall see soon how this can be achieved.

When do you use either of these databases for the XML documents you have? Presuming you have an e-commerce application that uses XML as a data transport. Most likely the data that are transported are structured with high regularity. The data within these XML documents are probably also used by non-XML applications. Things like entities and the encoding used by the XML probably wouldn't matter to you. This is a classic case of data-centric XML documents. It would be better to use an XML enabled database in this case. The XML enabled database would convert the data into a format, which would be stored (usually in relational databases). This data would then be used by existing applications that make use of the same database and you would experience less overheads when dealing with the data.

On the other hand, suppose you have a website that was built from a number of XML documents. Not only do you want to manage the content of the site, you would like to allow for search facilities. The XML documents you have are most likely to have irregular structures and are made of large amounts of mixed content. This is a situation when you have a document centric database. Linking to other entities becomes important for you and you would like to query for the documents for fragments of the documents too. A native XML database would be a better choice to do the job.

Most databases from the relational database circle has begun to support queries in XML format and support the insertion of XML documents into their relational database structure. Names like Oracle, Microsoft has all begun their support for XML. These databases can thus be classified under the XML enabled types.

Much interest has also been placed into native XML databases. One of the first of such databases is Tamino from Software AG. Many other companies have started marketing their own flavor of native XML databases, these include dbXML, eXcelon and X-Hive/DB. Standards have not existed yet for querying these databases although work has been done on standardizing XML queries from the W3C. Till such standards are final and popularized, the current scene finds both XML enabled and native XML databases implementing their own queries syntax.

As more and more applications begin to use XML as the lingua franca for various purposes, the need for XML documents management becomes more and more important. The databases that support XML will become an integral part of these applications and XML database management systems will rise in importance.

Editorial standards