Semistructured Data

Semistructured Data

The semistructured-data model plays an important role in database systems:

1. It serves as a model appropriate for integration of databases, that is, for describing the data included in two or more databases that include similar data with different schemas.

2. It serves as a document model in notations such as XML, to be taken up in "XML and Its Data Model", that are being used to share information on the Web.

In this section, we shall introduce the basic ideas behind "semistructured data" and how it can represent information more flexibly than the other models we have met preciously.

Motivation for the Semistructured-Data Model

Let us begin by remembering the E/R model, and its two basic kinds of data - the entity set and the relationship. Keep in mind also that the relational model has only one kind of data - the relation, yet we saw in "From E/R Diagrams to Relational Designs" how both entity sets and relationships could be represented by relations. There is an advantage to having two concepts, we could tailor an E/R design to the real-world situation we were modeling, using whichever of entity sets or relationships most closely matched the concept being modeled. There is also some advantage to replacing two concepts by one: the notation in which we express schemas is thereby simplified, and implementation techniques that make querying of the database more efficient can be applied to all sorts of data. We shall begin to appreciate these advantages of the relational model when we study implementation of the DBMS, starting in "Data Storage".

Now, let us examine the object-oriented model we introduced in "Introduction to ODL". There are two principal concepts, the class (or its extent) and the relationship. Similarly, the object-relational model has two analogous concepts: the attribute type (which includes classes) and the relation.

We may see the semistructured-data model as blending the two concepts, class-and-relationship or class-and-relation, much as the relational model blends entity sets and relationships. However, the motivation for the blending appears to be different in each case. While, as we mentioned, the relational model owes some of its success to the fact that it helps efficient implementation, interest in the semistructured-data model appears motivated mainly by its flexibility. While the other models seen so far each start from a notion of a schema - E/R diagrams, relation schemas, or ODL declarations, for example - semistructured data is "schemaless". More accurately, the data itself carries information about what its schema is, and that schema can vary arbitrarily, both over time and within a single database.