Remedial XML: Using XML Schema
The trouble with DTDs
Although using a DTD makes
it easy to specify which elements are allowed, which elements are required, and
how elements should be organized for a given XML document, the main trouble
comes when you try to enforce a data type for a particular element. The DTD
specification strictly defines structure but only supports relatively weak
content type specifications: There's no way to enforce that, for example, an
element named Date must contain a valid date.
Enter XML Schema,
which was elevated to Recommendation status by the W3C (World Wide Web Consortium, the standards
body for XML) in 2001, meaning that it's recommended for general use. If you are
interested, the official specification, along with some brief introductory
documents, can be found on the W3C Web
site. Note that other schema definitions exist, including the Japanese
standard RELAX and Microsoft's
XDR. However, XML Schema is the only one officially recognized by the W3C,
so I'll concentrate on it in this article.
XML Schema not only lets you
define the structure of an XML document but, unlike a DTD, also allows you to
constrain the content of the document. Also, an XML Schema is itself an XML
document, with a tag-based syntax that's a lot clearer than the special
characters found in a DTD.
Walking through a
schema
XML Schemas are built using a set of predefined XML elements
and attributes, which define the structure and content model of a document. An
elaborate set of rules (which, interestingly enough, are expressed as a DTD)
specifies the legal use of each schema element or attribute. Violate these
rules, and a parser will refuse to parse your schema and any documents
associated with it.
Let's take a look at the sample XML Schema in Listing A, which describes the same book catalog XML
document (Listing B) we've been using all along in this series.
Listing B makes one small change: The root catalog element now has two
new attributes that associate it with the catalog schema in Listing
A.
Looking at the catalog schema, you'll immediately notice that it
contains the standard XML header, <?xml version = "1.0"?>,
indicating that the schema is itself an XML document. The root element in any
schema must be schema, which will have one or more attributes describing
the schema. In this case, schema has a namespace definition attribute
(xmlns) that defines the namespace xs, which will be used as the
root namespace for all elements in the document.
The next element in our sample schema is the annotation element, which is
used to represent some kind of documentation about its parent element.
Annotation may contain one of two child elements, either
documentation or appinfo, or it may contain both. The former
element is used for human-readable documentation, whereas the latter is meant to
hold processing instructions for an application.
Next, we define the two
main elements (the root element catalog and its child element
book) used in the book catalog document using two element
elements. Both of these elements contain attributes that define the name and
allowed contents of each element. In this case, the catalog element is
defined to be of type catalogtype, and the book element is defined
to be of type elementtype; both types are defined later in the schema
document.
Which type are you?
As I've
said, XML Schema lets you declare an element in an XML document to be of a
particular type, allowing the parser to validate the content of a document as
well as its structure. XML Schema defines two main families of data type:
predefined simple types and complex types. The distinction between these two
data type groups boils down to the fact that complex types can contain other
elements as well as data, whereas simple types can only contain data. Simple
types give XML Schemas their low-level type validation abilities, allowing you
to define an element as any one of the types found in Figure A.
Figure A
Simple type | Definition |
string | String data. |
boolean | Binary True or False. |
date | A calendar date in the format CCYY-MM-DD. |
dateTime | A calendar date and time. |
time | A time in 24-hour format with an optional Coordinated Universal Time adjustment to indicate time zone. |
decimal | A number with an arbitrary precision and number of decimal places. |
integer | A subset of decimal that represents any integer numeric value. |
float | A standard representation of a 32-bit floating point numeric value. |
It's possible to define your own simple types as well. For an in-depth and technical discussion of the various XML Schema data types, see "XML Schema Part 2: DataTypes" on the W3C Web site.
Complex types are defined by a complexType element, which will usually have at least a name attribute that can be used to refer to the type when declaring other elements, unless it occurs inside an element element (see the next section). All complex types will contain one content definition element that basically defines what pattern of content the type is able to contain. Some of the available content patterns are shown in Figure B.
Figure B
Complex type | Definition |
sequence | All elements defined within defined must appear in the order listed, subject to modification by attributes like minOccurs and maxOccurs. |
choice | Any one and only one of the elements defined within must appear. |
any | Any or all of the elements defined within may appear. |
simpleContent | The complex type may only contain data with no nested elements. May extend a previously defined simple type by containing an extension element. |
complexContent | The complex type may only contain other elements. May extend a previously defined complex type by containing an extension element. |
attribute | The complex type may only contain the named attribute. |
The first complexType element in our sample schema defines the booktype type, which, as you can see from the documentation comment element, models a single book in a catalog. Booktype contains a sequence element, which tells the parser that elements of that complex type must contain all the elements appearing inside the sequence tag in that exact order. In the case of booktype, the elements author, title, genre, price, and publish_date must all appear in any booktype element.
What about description? It's listed in the sequence, so isn't it required? No, it isn't. The description element has an attribute named minOccurs, which defines the minimum number of times an element may appear as part of a complex type. In this case, minOccurs is zero, so description is an optional element.
Something similar is going on with the author element. It has a maxOccurs attribute with a value of unbounded, meaning that an infinite number of author elements may appear in the sequence, since a book may have more than one author but will always have at least one. An element with neither a minOccurs or maxOccurs attribute must appear once and only once in a sequence, so all the other elements in the booktype sequence are required and may only appear once.
The second and final complex type defined in our sample catalog schema is the catalogtype complex type. It, too, is a sequence containing one or more book elements, as you can see from the unbounded maxOccurs attribute.
Is that really the way it's done?
Depending on your background, the structure of the sample schema I've used in this article will either seem perfectly natural or scare the bejabbers out of you. For the record, it's also legal to define the catalog schema without the formal complex type declarations for the book and catalog elements, as it appears in Listing C. Note that the complexType elements in Listing C are nested inside element elements, and the element child of catalog's sequence element has a ref attribute to tell the parser that it is a reference to the previously defined book element.
I can hear thousands of you now, asking: "Well, if that's the case, then why did you do things the other, longer way?" To illustrate an important point of XML Schema: It's extensible. By defining a type formally, you can reuse that type in multiple documents and even extend it in a different schema, much as you'd reuse or extend an abstract data type or object in an application.
The tools of the trade
By now, you should realize that XML Schema's syntax is deceptively simple. Although it's certainly possible to create a schema by hand in a simple text editor (and I'm speaking from experience here), doing so can be a maddening experience. Better to use one of the several XML tools that have cropped up recently that offer graphical systems of creating XML Schemas. XML Spy and Cape Clear Studio are both full XML IDEs that feature XML Schema builders, and dtd2xs is a DTD-to-XML Schema conversion utility that's available as both a standalone application and a Java class. As is the case with DTDs, a large number of standardized XML Schema definitions are available, which you can adapt for use in your applications, should you feel the need.
Conclusion
XML Schemas, with their ability to enforce document content as well as structure, are an important and powerful new standard in the XML world. In this article, I have barely scratched the surface of the possibilities, but I hope I’ve given you a good grounding in the basics so that you can explore on your own. Until next time, when we'll explore the world of the DOM parser, keep your volleyballs clean and your documents well formed.