Remedial XML: Enforcing document formats with DTDs

To enforce structure requirements for an XML document, you have to turn to one of XML's attendant technologies, data type definition (DTD). And that's the subject of this article. I'll be referring you to the same sample XML document included for your reference in Listing A.
DTDs defined
DTDs are the oldest and simplest way of specifying the format for an XML
document. Like most things XML, the official definition of a DTD is maintained
by the World Wide Web Consortium, or W3C, and you can read about it as part of
W3C's current XML specification.
DTDs work by specifying the allowed tags, elements, and attributes, the number
of occurrences of each element, and the order in which elements may appear in
other elements for a given XML document. A DTD gives you a limited form of data
type enforcement, as well—in the sense that you can specify whether an element
should contain other elements or data or remain empty.
An XML document
declares that it has an associated DTD by including a !DOCTYPE tag in the
document header, which will contain the filename of the DTD (if the SYSTEM
keyword appears, as in Listing A) or a URL to the DTD file (if the PUBLIC
keyword appears). An XML document may also be declared as "stand alone," which
simply means that the DTD is included in the document as part of the
!DOCTYPE header tag.
When a parser opens an XML document with an
associated DTD, it attempts to validate the document against the DTD. If the
document violates the definition contained in the DTD, the parser will not parse
it and may raise an error.
There are a number of available industry-standard DTDs that are specifically designed for various kinds of information. Because these DTDs are predefined and the XML parser does all the work of validation for you, they are easy to implement in an XML application. Using a standard DTD will definitely be worthwhile if you expect to exchange information with other applications or with other companies, and a standardized data format always makes life easier for application integrators. XML.org maintains a searchable catalog of standard DTDs grouped by industry, and I'm sure you'll find dozens of other such resources.
Defining structure for your data
The DTD for an XML document consists of a list of DTD statements, referred to as content models, defining the allowed elements and the order in which they should appear in a document. The general format of a DTD statement looks like this:
<!ELEMENT elementname (elementtype modifiers)>
The modifiers given to a DTD element define the allowed contents and number of
occurrences of an element, as shown in Table A.
Some frequently used DTD element modifiers | ![]() |
Check out Listing B, which is a DTD that describes the structure of the book catalog found in Listing A. Let's walk through the DTD and examine it piece by piece.
First, we see the element definition for the root catalog element, which must contain one or more book elements (denoted by the + modifier appearing after the book element name):
<!ELEMENT catalog (book+)>
On the next several lines, you'll see the definition for the book element itself, which must contain, in order:
Next, the DTD defines the simple elements used in the book element: author, title, genre, price, publish_date, and description. All these elements are defined as Parsed Character Data (PCDATA), which basically means that they are data-carrying elements and may not hold other elements.
What's the rest of that stuff?
Although the DTD in Listing B is perfectly valid, I've broken with convention a
bit here. I didn't define things in the order they are used in the accompanying
document, and I saved one, the book element's id attribute, for last, so
that I could discuss it separately. In practice, you'll want to define elements
in the order they will appear in your XML document and define attributes
immediately after the elements they apply to.
The book element includes a single attribute, which exists only to ensure that each book in the catalog has a unique identifier. The declaration for that identifying attribute is:
<!ATTLIST associatedelement
attributename1 attributetype1 modifiers
attributename2 attributetype2 modifiers
class=code>>
Associatedelement is the element for which the attribute or attributes are defined, and it's followed by a list of all the defined attributes. Attributename is the name of the attribute followed by attributetype, which is the type of the attribute. Allowed types for an attribute include:
The usefulness of some of these types will become evident only when you begin performing queries or transforms on a document: Don't worry; those topics will be coming soon to a development Web site near you.
Attributes can also have any of the following modifiers, which give the parser additional information about the attribute, usually its default value. Attribute modifiers include:
The bad news
DTDs have one shortcoming: It's not possible to enforce any type rules for a particular element, beyond whether an element must contain other elements, remain empty, or contain data. So, their usefulness is somewhat limited in situations where you want to check that more than just the structure of a document is correct. For those situations, you'll want to turn to XML schemas, which we'll look at in the next article in this series.