Welcome back to my series on the fundamentals of XML for programmers. The first
in this series introduced you to the basic structure of an XML
document and the syntax involved in defining that structure. If you'll recall,
an XML document by itself only defines the structure of the data it contains. It
makes no assumptions about the type of data an element contains or the structure
a document should have. This means that as long as the basic rules of XML syntax
are met, you can basically stick anything you want into a given element; it's up
to you to make sure it's relevant. While operating on the honor system like this
may be a good method for distributing soda, it's probably not the best way to
ensure that your application's data stays coherent.
To enforce structure requirements for an XML document, you have to turn to one
of XML's attendant technologies, data type definition (DTD). And that's the
subject of this article. I'll be referring you to the same sample XML document
included for your reference in Listing A.
DTDs are the oldest and simplest way of specifying the format for an XML
document. Like most things XML, the official definition of a DTD is maintained
by the World Wide Web Consortium, or W3C, and you can read about it as part of
W3C's current XML specification.
DTDs work by specifying the allowed tags, elements, and attributes, the number
of occurrences of each element, and the order in which elements may appear in
other elements for a given XML document. A DTD gives you a limited form of data
type enforcement, as well—in the sense that you can specify whether an element
should contain other elements or data or remain empty.
An XML document
declares that it has an associated DTD by including a !DOCTYPE tag in the
document header, which will contain the filename of the DTD (if the SYSTEM
keyword appears, as in Listing A) or a URL to the DTD file (if the PUBLIC
keyword appears). An XML document may also be declared as "stand alone," which
simply means that the DTD is included in the document as part of the
!DOCTYPE header tag.
When a parser opens an XML document with an
associated DTD, it attempts to validate the document against the DTD. If the
document violates the definition contained in the DTD, the parser will not parse
it and may raise an error.
There are a number of available industry-standard DTDs that are specifically
designed for various kinds of information. Because these DTDs are predefined and
the XML parser does all the work of validation for you, they are easy to
implement in an XML application. Using a standard DTD will definitely be
worthwhile if you expect to exchange information with other applications or with
other companies, and a standardized data format always makes life easier for
application integrators. XML.org maintains a searchable
catalog of standard DTDs grouped by industry, and I'm sure you'll find dozens of
other such resources.
Defining structure for your data
The DTD for an XML document consists of a list of DTD statements, referred to as content models, defining the allowed elements and the order in which they should appear in a document. The general format of a DTD statement looks like this:
<!ELEMENT elementname (elementtype modifiers)>
The modifiers given to a DTD element define the allowed contents and number of
occurrences of an element, as shown in Table A.
|Modifier ||Example ||Meaning|
(A) ||May contain one
and only one occurrence of A|
(B?) ||May contain zero
or one occurrence of B|
(C*) ||May contain zero,
one, or more occurrences of C|
(D+) ||May contain one
or more occurrences of D, always at least one occurrence is required|
(E|F) ||May contain
either one occurrence of E or one occurrence of F|
|EMPTY ||SomeElement(EMPTY) ||Element is always
empty and may not contain anything|
|#PCDATA ||SomeElement(#PCDATA) ||May contain any
form of non-element data|
frequently used DTD element modifiers
Check out Listing B, which is a DTD that describes the structure of
the book catalog found in Listing A. Let's walk through the DTD and examine it
piece by piece.
First, we see the element definition for the root catalog element, which
must contain one or more book elements (denoted by the + modifier appearing
after the book element name):
<!ELEMENT catalog (book+)>
On the next several lines, you'll see the definition for the book element itself, which must contain, in order:
One or more author elements, since a book may have coauthors. A single title element. One or more genre elements, because a book may fit in multiple
genres. A single price element. A single publish_date element. Zero or one description element, because the description is
Next, the DTD defines the simple elements used in the book element:
author, title, genre, price, publish_date,
and description. All these elements are defined as Parsed Character Data
(PCDATA), which basically means that they are data-carrying
elements and may not hold other elements.
What's the rest of that stuff?
Although the DTD in Listing B is perfectly valid, I've broken with convention a
bit here. I didn't define things in the order they are used in the accompanying
document, and I saved one, the book element's id attribute, for last, so
that I could discuss it separately. In practice, you'll want to define elements
in the order they will appear in your XML document and define attributes
immediately after the elements they apply to.
The book element includes a single attribute, which exists only to ensure
that each book in the catalog has a unique identifier. The declaration for that
identifying attribute is:
attributename1 attributetype1 modifiers
attributename2 attributetype2 modifiers
Associatedelement is the element for which the attribute or attributes
are defined, and it's followed by a list of all the defined attributes.
Attributename is the name of the attribute followed by
attributetype, which is the type of the attribute. Allowed types for an
CDATA—This is a simple string value, used when the attribute contains
actual data instead of a reference to another element. ID—This is a unique identifier. Only one attribute may be defined as
an ID for a particular element, and it must be either required or implied
(see below). IDREF—This is a reference to another element by its ID
attribute. NMTOKEN—This is a reference to a single element or a list of
space-delimited elements (NMTOKENS) by name. Enumeration—This may have any of the values defined in a list of
values of type CDATA.
The usefulness of some of these types will become evident only when you begin performing queries or transforms on a document: Don't worry; those topics will be coming soon to a development Web site near you.
Attributes can also have any of the following modifiers, which give the parser additional information about the attribute, usually its default value. Attribute modifiers include:
#REQUIRED—The attribute is required and must always be defined and
have a value. #IMPLIED—The attribute has no default value and is optional. #FIXED—The attribute always has a particular value. The value should
follow the modifier. Example: <!ATTLIST State SalesTax #FIXED
"0.06">. A literal string that defines the default value of an attribute. If an
element does not explicitly contain the attribute in the document being parsed,
the literal string will be the attribute's value.
The bad news
DTDs have one shortcoming: It's not possible to enforce any type rules for a particular element, beyond whether an element must contain other elements, remain empty, or contain data. So, their usefulness is somewhat limited in situations where you want to check that more than just the structure of a document is correct. For those situations, you'll want to turn to XML schemas, which we'll look at in the next article in this series.