Document Type Definitions

Document Type Definitions

In order for a computer to process XML documents automatically, there needs to be something like a schema for the documents. That is, we need to be told what tags can appear in a collection of documents and how tags can be nested. The description of the schema is given by a grammar-like set of rules, called a document type definition or DTD. It is intended that companies or communities wishing to share data will each create a DTD that explains the form(s) of the documents they share and establishing a shared view of the semantics of their tags. For example, there could be a DTD for explaining protein structures, a DTD for explaining the purchase and sale of auto parts, and so on.

The gross structure of a DTD is:

The root-tag is used (with its matching ender) to surround a document that conforms to the rules of this DTD. An element is explained by its name, which is the tag used to surround portions of the document that represent that element, and a parenthesized list of components. The latter are tags that may or must appear within the tags for the element being explained. The exact requirements on each component are indicated in a manner we shall see shortly.

There is, on the other hand, an important special case. (#PCDATA) after an element name means that element has a value that is text, and it has no tags nested within.
Example (a) : In Figure (a) we see a DTD for stars. The name and surrounding tag is STARS (XML, like HTML, is case-insensitive, so STARS is clearly the root-tag). The first element definition says that inside the matching pair of tags <STARS>. . .</STARS> we will find zero or more STAR tags, each representing a single star. It is the * in (STAR*) that says "zero or more", i.e., "any number of".

A DTD for movie stars

The second element, STAR, is declared to consist of three kinds of subelements: NAME, ADDRESS, and MOVIES. They must appear in this order, and each must be present. However, the + following ADDRESS says "one or more"; that is, there can be any number of addresses listed for a star, but there must be at least one. The NAME element is then described to be "PCDATA" i.e., simple text. The fourth element says that an address element consists of fields for a street and a city, in that order.

Then, the MOVIES element is described to have zero or more elements of type MOVIE within it; again, the * says "any number of". A MOVIE element is described to consist of title and year fields, each of which are simple text. Figure (b) is an example of a document that conforms to the DTD of Figure (a).

The components of an element E are usually other elements. They must appear between the tags <E> and </E> in the order listed. On the other hand, there are several operators that control the number of times elements appear.

1. A * following an element means that the element may occur any number of times, including zero times.

2. A + following an element means that the element may occur one or more times.

3. A ? following an element means that the element may occur either zero times or one time, but no more.

Example of a document following the DTD of Figure (a)
4. The symbol I may appear between elements, or between parenthesized groups of elements to indicate "or" that is, either the element(s) on the left appear or the element(s) on the right appear, but not both. For instance, the expression (#PCDATA I (STREET, CITY)) as components for element  ADDRESS would mean that an address could be either simple text, or consist of tagged street and city components.  

Using a DTD

If a document is intended to conform to a certain DTD, we can either:

a) Include the DTD itself as a preamble to the document, or

b) In the opening line, refer to the DTD, which must be stored separately in the file system accessible to the application that is processing the document.

Example (b) : Here is how we might introduce the document of Figure (b) to assert that it is intended to conform to the DTD of Figure (a).

The parameter STANDALONE = "no" says that a DTD is being used. Recall we set this parameter to "yes" when we did not wish to specify a DTD for the document. The location from which the DTD can be obtained is given in the !DOCTYPE clause, where the keyword SYSTEM followed by a file name gives this location.