XML and Its Data Model

XML and Its Data Model

XML (Extensible Markup Language) is a tag-based notation for "marking" documents, much similar to the familiar HTML or less familiar SGML. A document is nothing more nor less than a file of characters. However, while HMTL's tags talk about the presentation of the information included in documents - for example, which portion is to be displayed in italics or what the entries of a list are - XML tags talk about the meaning of substrings within the document.

In this section we shall introduce the rudiments of XML. We shall see that it captures, in a linear form, the same structure as do the graphs of semistructured data introduced in "Semistructured Data". Particularly, tags play the same role as did the labels on the arcs of a semistructured-data graph. We then introduce the DTD ("document type definition"), which is a flexible form of schema that we can place on certain documents with XML tags.

Semantic Tags

Tags in XML are text surrounded by triangular brackets, i.e., <. . .>, as in HMTL. Also as in HTML, tags usually come in matching pairs, with a beginning tag like <F00> and a matching ending tag that is the same word with a slash, like </F00>. In HTML there is an option to have tags with no matching ender, like <P> for paragraphs, but such tags are not permitted in XML. When tags come in matching begin-end pairs, there is a requirement that the pairs be nested. That is, between a matching pair <F00> and </F00>, there can be any number of other matching pairs, but if the beginning of a pair is in this range then the ending of the pair must also be in the range.

XML is designed to be used in two somewhat different modes:

1. Well-formed XML allows you to invent your own tags, much like the arc-labels in semistructured data. This mode corresponds quite closely to semistructured data, in that there is no schema, and each document is free to use whatever tags the author of the document wishes.

2. Valid XML involves a Document Type Definition that specifies the allowable tags and gives a grammar for how they may be nested. This form of XML is intermediate between the strict-schema models such as the relational or ODL models, and the totally schemaless world of semistructured data. As we shall see in "Document Type Definitions", DTD's usually allow more flexibility in the data than does a conventional schema; DTD's often allow optional fields or missing fields, for example.

Well-Formed XML

The minimal requirement for well-formed XML is that the document begins with a declaration that it is XML, and that it have a root tag surrounding the entire body of the text.  Thus, a well-formed XML document would have an outer structure like:
The first line shows that the file is an XML document. The parameter STANDALONE = "yes" shows that there is no DTD for this document; i.e., it is well-formed XML. Notice that this initial declaration is defined by special markers <?. . . ?>.

An XML document about stars and movies

Example : In Figure (a) is an XML document that corresponds roughly to the data in "Semistructured Data Representation" Figure (a). The root tag is STAR-MOVIE-DATA. We see two sections surrounded by the tag <STAR> and its matching </STAR>. Within each section are subsections giving the name of the star. One, for Carrie Fisher, has two subsections, each giving the address of one of her homes. These sections are surrounded by an <ADDRESS> tag and its ender. The section for Mark Hamill has only entries for one street and one city, and does not use an <ADDRESS> tag to group these. This distinction appeared as well in "Semistructured Data Representation" Figure (a).

Notice that the document of Figure (a) does not represent the relationship "stars-in" between stars and movies. We could store information about each movie of a star within the section devoted to that star, for example

On the other hand, that approach leads to redundancy, since all information about the movie is repeated for each of its stars (we have shown no information except a movie's key - title and year - which does not in fact represent an instance of redundancy). We shall see in "Attribute Lists" how XML handles the problem that tags inherently form a tree structure.