Attribute Lists

Attribute Lists

There is a strong relationship between XML documents and semistructured data. Assume that for some pair of matching tags <T> and </T> in a document we make a node n. Then, if <S> and </S> are matching tags nested directly within the pair <T> and </T> (i.e., there are no matched pairs surrounding the S-pair but surrounded by the T-pair), we draw an arc labeled S from node n to the node for the S-pair. Then the result will be an example of semistructured data that has basically the same structure as the document.

Unluckily, the relationship doesn't go the other way, with the limited subset of XML we have explained so far. We need a way to express in XML the idea that an instance of an element might have more than one arc leading to that element. Clearly, we cannot nest a tag-pair directly within more than one tag-pair, so nesting is not enough to represent various predecessors of a node. The additional features that allow us to represent all semistructured data in XML are attributes within tags, identifiers (ID's), and identifier references (IDREF's).

Opening tags can have attributes that appear within the tag, in analogy to constructs like <A HREF = . . . > in HTML. Keyword ! ATTLIST introduces a list of attributes and their types for a given element. One general use of attributes is to associate single, labeled values with a tag. This usage is an alternative to subtags that are simple text (i.e., declared as PCDATA).

Another important purpose of such attributes is to represent semistructured data that does not have a tree form. An attribute for elements of type E that is declared to be an ID will be given values that uniquely identify each portion of the document that is surrounded by an <E> and matching </E> tag. In terms of semistructured data, an ID provides a unique name for a node.

Other attributes may be declared to be IDREF's. Their values are the ID's associated with other tags. By giving one tag instance (i.e., a node in semistructured data) an ID with a value v and another tag instance an IDREF with value v, the latter is effectively given an arc or link to the former. The following example illustrates both the syntax for declaring ID's and IDREF's and the significance of using them in data.

A DTD for stars and movies

Example : Figure (a) shows a revised DTD, in which stars and movies are given equal status, and ID-IDREF correspondence is used to explain the many-many relationship between movies and stars. Analogously, the arcs between nodes representing stars and movies explain the same many-many relationship in the semistructured data of "Semistructured Data Representation" Figure (a). The name of the root tag for this DTD has been changed to STARS-MOVIES and its elements are a sequence of stars followed by a sequence of movies.

A star no longer has a set of movies as subelements, as was the case for the DTD of "Document Type Definitions" Figure (a). Rather, its only subelements are a name and address, and in the beginning <STAR> tag we shall find an attribute starredIn whose value is a list of ID's for the movies of the star. Note that the attribute starredIn is declared to be of type IDREFS, rather than IDREF. The additional "S" allows the value of starredIn to be a list of ID's for movies, rather than a single movie, as would be the case if the type IDREF were used.

A <STAR> tag also has an attribute starId. Since it is declared to be of type ID, the value of starId may be referenced by <MOVIE> tags to indicate the stars of the movie. That is, when we look at the attribute list for MOVIE in Figure (a), we see that it has an attribute movieId of type ID; these are the ID's that will appear on lists that are the values of starredIn tags. Symmetrically, the attribute starsOf of MOVIE is a list of ID's for stars.

Figure (b) is an instance of a document that conforms to the DTD of Figure (a). It is quite similar to the semistructured data of "Semistructured Data Representation" Figure (a). It contains more data - three movies instead of only one. On the other hand, the only structural difference is that here, all stars have an ADDRESS subelement, even if they have only one address, while in "Semistructured Data Representation" Figure (a) we went directly from the Mark-Hamill node to street and city nodes.

Example of a document following the DTD