X
Business

Remedial XML: Using XML Schema

One way of enforcing structural requirements for an XML document is by using a document type description (DTD). In this article, we'll briefly touch on the shortcomings of DTDs and discuss the basics of a newer, more powerful standard: XML Schemas.
Written by Lamont Adams, Contributor
If you've been following this series on Builder.com, you already know that XML describes the structure of data and makes no assumptions about what the data it describes actually is, and whether that structure is correct. One way of enforcing structural requirements for an XML document is by using a document type description (DTD). That was the subject of the previous article in this series. In this article, we'll briefly touch on the shortcomings of DTDs and discuss the basics of a newer, more powerful standard: XML Schemas.

The trouble with DTDs
Although using a DTD makes it easy to specify which elements are allowed, which elements are required, and how elements should be organized for a given XML document, the main trouble comes when you try to enforce a data type for a particular element. The DTD specification strictly defines structure but only supports relatively weak content type specifications: There's no way to enforce that, for example, an element named Date must contain a valid date.
Enter XML Schema, which was elevated to Recommendation status by the W3C (World Wide Web Consortium, the standards body for XML) in 2001, meaning that it's recommended for general use. If you are interested, the official specification, along with some brief introductory documents, can be found on the W3C Web site. Note that other schema definitions exist, including the Japanese standard RELAX and Microsoft's XDR. However, XML Schema is the only one officially recognized by the W3C, so I'll concentrate on it in this article.
XML Schema not only lets you define the structure of an XML document but, unlike a DTD, also allows you to constrain the content of the document. Also, an XML Schema is itself an XML document, with a tag-based syntax that's a lot clearer than the special characters found in a DTD.
Walking through a schema
XML Schemas are built using a set of predefined XML elements and attributes, which define the structure and content model of a document. An elaborate set of rules (which, interestingly enough, are expressed as a DTD) specifies the legal use of each schema element or attribute. Violate these rules, and a parser will refuse to parse your schema and any documents associated with it.
Let's take a look at the sample XML Schema in Listing A, which describes the same book catalog XML document (Listing B) we've been using all along in this series. Listing B makes one small change: The root catalog element now has two new attributes that associate it with the catalog schema in Listing A.
Looking at the catalog schema, you'll immediately notice that it contains the standard XML header, <?xml version = "1.0"?>, indicating that the schema is itself an XML document. The root element in any schema must be schema, which will have one or more attributes describing the schema. In this case, schema has a namespace definition attribute (xmlns) that defines the namespace xs, which will be used as the root namespace for all elements in the document.

The next element in our sample schema is the annotation element, which is used to represent some kind of documentation about its parent element. Annotation may contain one of two child elements, either documentation or appinfo, or it may contain both. The former element is used for human-readable documentation, whereas the latter is meant to hold processing instructions for an application.
Next, we define the two main elements (the root element catalog and its child element book) used in the book catalog document using two element elements. Both of these elements contain attributes that define the name and allowed contents of each element. In this case, the catalog element is defined to be of type catalogtype, and the book element is defined to be of type elementtype; both types are defined later in the schema document.
Which type are you?
As I've said, XML Schema lets you declare an element in an XML document to be of a particular type, allowing the parser to validate the content of a document as well as its structure. XML Schema defines two main families of data type: predefined simple types and complex types. The distinction between these two data type groups boils down to the fact that complex types can contain other elements as well as data, whereas simple types can only contain data. Simple types give XML Schemas their low-level type validation abilities, allowing you to define an element as any one of the types found in Figure A.

Figure A

Simple type

Definition

string

String data.

boolean

Binary True or False.

date

A calendar date in the format CCYY-MM-DD.

dateTime

A calendar date and time.

time

A time in 24-hour format with an optional Coordinated Universal Time adjustment to indicate time zone.

decimal

A number with an arbitrary precision and number of decimal places.

integer

A subset of decimal that represents any integer numeric value.

float

A standard representation of a 32-bit floating point numeric value.

XML Schema predefined simple types
It's possible to define your own simple types as well. For an in-depth and technical discussion of the various XML Schema data types, see "XML Schema Part 2: DataTypes" on the W3C Web site.
Complex types are defined by a complexType element, which will usually have at least a name attribute that can be used to refer to the type when declaring other elements, unless it occurs inside an element element (see the next section). All complex types will contain one content definition element that basically defines what pattern of content the type is able to contain. Some of the available content patterns are shown in Figure B.

Figure B

Complex type

Definition

sequence

All elements defined within defined must appear in the order listed, subject to modification by attributes like minOccurs and maxOccurs.

choice

Any one and only one of the elements defined within must appear.

any

Any or all of the elements defined within may appear.

simpleContent

The complex type may only contain data with no nested elements. May extend a previously defined simple type by containing an extension element.

complexContent

The complex type may only contain other elements. May extend a previously defined complex type by containing an extension element.

attribute

The complex type may only contain the named attribute.

Some other allowed XML Schema complex types
The first complexType element in our sample schema defines the booktype type, which, as you can see from the documentation comment element, models a single book in a catalog. Booktype contains a sequence element, which tells the parser that elements of that complex type must contain all the elements appearing inside the sequence tag in that exact order. In the case of booktype, the elements author, title, genre, price, and publish_date must all appear in any booktype element.
What about description? It's listed in the sequence, so isn't it required? No, it isn't. The description element has an attribute named minOccurs, which defines the minimum number of times an element may appear as part of a complex type. In this case, minOccurs is zero, so description is an optional element.
Something similar is going on with the author element. It has a maxOccurs attribute with a value of unbounded, meaning that an infinite number of author elements may appear in the sequence, since a book may have more than one author but will always have at least one. An element with neither a minOccurs or maxOccurs attribute must appear once and only once in a sequence, so all the other elements in the booktype sequence are required and may only appear once.
The second and final complex type defined in our sample catalog schema is the catalogtype complex type. It, too, is a sequence containing one or more book elements, as you can see from the unbounded maxOccurs attribute.
Is that really the way it's done?
Depending on your background, the structure of the sample schema I've used in this article will either seem perfectly natural or scare the bejabbers out of you. For the record, it's also legal to define the catalog schema without the formal complex type declarations for the book and catalog elements, as it appears in Listing C. Note that the complexType elements in Listing C are nested inside element elements, and the element child of catalog's sequence element has a ref attribute to tell the parser that it is a reference to the previously defined book element.
I can hear thousands of you now, asking: "Well, if that's the case, then why did you do things the other, longer way?" To illustrate an important point of XML Schema: It's extensible. By defining a type formally, you can reuse that type in multiple documents and even extend it in a different schema, much as you'd reuse or extend an abstract data type or object in an application.
The tools of the trade
By now, you should realize that XML Schema's syntax is deceptively simple. Although it's certainly possible to create a schema by hand in a simple text editor (and I'm speaking from experience here), doing so can be a maddening experience. Better to use one of the several XML tools that have cropped up recently that offer graphical systems of creating XML Schemas. XML Spy and Cape Clear Studio are both full XML IDEs that feature XML Schema builders, and dtd2xs is a DTD-to-XML Schema conversion utility that's available as both a standalone application and a Java class. As is the case with DTDs, a large number of standardized XML Schema definitions are available, which you can adapt for use in your applications, should you feel the need.
Conclusion
XML Schemas, with their ability to enforce document content as well as structure, are an important and powerful new standard in the XML world. In this article, I have barely scratched the surface of the possibilities, but I hope I’ve given you a good grounding in the basics so that you can explore on your own. Until next time, when we'll explore the world of the DOM parser, keep your volleyballs clean and your documents well formed.




Editorial standards