Remedial XML: Enforcing document formats with DTDs

To enforce structure requirements for an XML document, you have to turn to one of XML's attendant technologies, data type definition (DTD). And that's the subject of this article.
Written by Lamont Adams, Contributor
Welcome back to my series on the fundamentals of XML for programmers. The first article in this series introduced you to the basic structure of an XML document and the syntax involved in defining that structure. If you'll recall, an XML document by itself only defines the structure of the data it contains. It makes no assumptions about the type of data an element contains or the structure a document should have. This means that as long as the basic rules of XML syntax are met, you can basically stick anything you want into a given element; it's up to you to make sure it's relevant. While operating on the honor system like this may be a good method for distributing soda, it's probably not the best way to ensure that your application's data stays coherent.

To enforce structure requirements for an XML document, you have to turn to one of XML's attendant technologies, data type definition (DTD). And that's the subject of this article. I'll be referring you to the same sample XML document included for your reference in Listing A.

DTDs defined
DTDs are the oldest and simplest way of specifying the format for an XML document. Like most things XML, the official definition of a DTD is maintained by the World Wide Web Consortium, or W3C, and you can read about it as part of W3C's current XML specification.

DTDs work by specifying the allowed tags, elements, and attributes, the number of occurrences of each element, and the order in which elements may appear in other elements for a given XML document. A DTD gives you a limited form of data type enforcement, as well—in the sense that you can specify whether an element should contain other elements or data or remain empty.
An XML document declares that it has an associated DTD by including a !DOCTYPE tag in the document header, which will contain the filename of the DTD (if the SYSTEM keyword appears, as in Listing A) or a URL to the DTD file (if the PUBLIC keyword appears). An XML document may also be declared as "stand alone," which simply means that the DTD is included in the document as part of the !DOCTYPE header tag.
When a parser opens an XML document with an associated DTD, it attempts to validate the document against the DTD. If the document violates the definition contained in the DTD, the parser will not parse it and may raise an error.

There are a number of available industry-standard DTDs that are specifically designed for various kinds of information. Because these DTDs are predefined and the XML parser does all the work of validation for you, they are easy to implement in an XML application. Using a standard DTD will definitely be worthwhile if you expect to exchange information with other applications or with other companies, and a standardized data format always makes life easier for application integrators. XML.org maintains a searchable catalog of standard DTDs grouped by industry, and I'm sure you'll find dozens of other such resources.

Defining structure for your data
The DTD for an XML document consists of a list of DTD statements, referred to as content models, defining the allowed elements and the order in which they should appear in a document. The general format of a DTD statement looks like this:

<!ELEMENT elementname (elementtype modifiers)>

The modifiers given to a DTD element define the allowed contents and number of occurrences of an element, as shown in Table A.

Modifier Example Meaning
None SomeElement (A) May contain one and only one occurrence of A
? SomeElement (B?) May contain zero or one occurrence of B
* SomeElement (C*) May contain zero, one, or more occurrences of C
+ SomeElement (D+) May contain one or more occurrences of D, always at least one occurrence is required
| SomeElement (E|F) May contain either one occurrence of E or one occurrence of F
EMPTY SomeElement(EMPTY) Element is always empty and may not contain anything
#PCDATA SomeElement(#PCDATA) May contain any form of non-element data

Some frequently used DTD element modifiers

Check out Listing B, which is a DTD that describes the structure of the book catalog found in Listing A. Let's walk through the DTD and examine it piece by piece.

First, we see the element definition for the root catalog element, which must contain one or more book elements (denoted by the + modifier appearing after the book element name):

<!ELEMENT catalog (book+)>

On the next several lines, you'll see the definition for the book element itself, which must contain, in order:

  • One or more author elements, since a book may have coauthors.
  • A single title element.
  • One or more genre elements, because a book may fit in multiple genres.
  • A single price element.
  • A single publish_date element.
  • Zero or one description element, because the description is optional.
  • Next, the DTD defines the simple elements used in the book element: author, title, genre, price, publish_date, and description. All these elements are defined as Parsed Character Data (PCDATA), which  basically means that they are data-carrying elements and may not hold other elements.

    What's the rest of that stuff?
    Although the DTD in Listing B is perfectly valid, I've broken with convention a bit here. I didn't define things in the order they are used in the accompanying document, and I saved one, the book element's id attribute, for last, so that I could discuss it separately. In practice, you'll want to define elements in the order they will appear in your XML document and define attributes immediately after the elements they apply to.

    The book element includes a single attribute, which exists only to ensure that each book in the catalog has a unique identifier. The declaration for that identifying attribute is:

    <!ATTLIST associatedelement
       attributename1 attributetype1 modifiers
       attributename2 attributetype2 modifiers

    Associatedelement is the element for which the attribute or attributes are defined, and it's followed by a list of all the defined attributes. Attributename is the name of the attribute followed by attributetype, which is the type of the attribute. Allowed types for an attribute include:

  • CDATA—This is a simple string value, used when the attribute contains actual data instead of a reference to another element.
  • ID—This is a unique identifier. Only one attribute may be defined as an ID for a particular element, and it must be either required or implied (see below).
  • IDREF—This is a reference to another element by its ID attribute.
  • NMTOKEN—This is a reference to a single element or a list of space-delimited elements (NMTOKENS) by name.
  • Enumeration—This may have any of the values defined in a list of values of type CDATA.
  • The usefulness of some of these types will become evident only when you begin performing queries or transforms on a document: Don't worry; those topics will be coming soon to a development Web site near you.

    Attributes can also have any of the following modifiers, which give the parser additional information about the attribute, usually its default value. Attribute modifiers include:

  • #REQUIRED—The attribute is required and must always be defined and have a value.
  • #IMPLIED—The attribute has no default value and is optional.
  • #FIXED—The attribute always has a particular value. The value should follow the modifier. Example: <!ATTLIST State SalesTax #FIXED "0.06">.
  • A literal string that defines the default value of an attribute. If an element does not explicitly contain the attribute in the document being parsed, the literal string will be the attribute's value.
  • The bad news
    DTDs have one shortcoming: It's not possible to enforce any type rules for a particular element, beyond whether an element must contain other elements, remain empty, or contain data. So, their usefulness is somewhat limited in situations where you want to check that more than just the structure of a document is correct. For those situations, you'll want to turn to XML schemas, which we'll look at in the next article in this series.

    Editorial standards