Information Integration Via Semistructured Data

Information Integration Via Semistructured Data

Unlike the other models we have discussed, data in the semistructured model is self-describing; the schema is attached to the data itself. That is, each node (except the root) has an arc or arcs entering it, and the labels on these arcs tell what role the node is playing with respect to the node at the tail of the arc. In all the other models, data has a fixed schema, separate from the data and the role(s) played by data items is implicit in the schema.

One might naturally wonder whether there is an advantage to creating a database without a schema, where one could enter data at will, and attach to the data whatever schema information you felt was suitable for that data. There are in fact some small-scale information systems such as Lotus Notes that take the self-describing-data approach. On the other hand, when people design databases to hold large amounts of data, it is usually accepted that the advantages of fixing the schema far outweigh the flexibility that comes from attaching the schema to the data. For example, fixing the schema allows the data to be organized with data structures that support efficient answering of queries, as we shall discuss beginning in "Index Structures".

Yet the flexibility of semistructured data has made it important in two applications. We shall discuss its use in documents in "XML and Its Data Model", but here we shall examine its use as a tool for information integration. As databases have proliferated, it has become a common requirement that data in two or more of them be accessible as if they were one database. For example, companies may combine; each has its own personnel database, its own database of sales, inventory, product designs and perhaps many other matters. If corresponding databases had the same schemas, then combining them would be simple; for example, we could take the union of the tuples in two relations that had the same schema and played the same roles in the two databases.

Nevertheless, life is rarely that simple. Independently developed databases are unlikely to share a schema, even if they talk about the same things, such as personnel. For example, one employee database may record spouse-name, another not. One may have a way to represent several addresses, phones or emails for an employee, another database may allow only one of each. One database might be relational, another object-oriented.

To make matters more complex, databases tend over time to be used in so many different applications that it is impossible to shut them down and copy or translate their data into another database, even if we could figure out an efficient way to transform the data from one schema to another. This situation is frequently referred to as the legacy-database problem; once a database has been in existence for a while, it becomes impossible to disentangle it from the  applications that grow up around it, so the database can never be decommissioned.

A possible solution to the legacy-database problem is suggested in Figure (a). We show two legacy databases with an interface; there could be many legacy systems involved. The legacy systems are each unchanged, so they can support their usual applications.

Integrating two legacy databases through an interface that supports semistructured data

For flexibility in integration, the interface supports semistructured data, and the user is allowed to query the interface using a query language that is suitable for such data. The semistructured data may be constructed by translating the data at the sources, using components called wrappers (or "adapters") that are each designed for the purpose of translating one source to semistructured data.

Alternatively, the semistructured data at the interface may not exist at all. Rather, the user queries the interface as if there were semistructured data, while the interface answers the query by posing queries to the sources, each referring to the schema found at that source.

Example (a): We can see in "Semistructured Data Representation" Figure (a) a possible effect of information about stars being gathered from many sources. Notice that the address information for Carrie Fisher has an address concept, and the address is then broken into street and city. That situation corresponds roughly to data that had a nested-relation schema like Stars (name, address (street, city) ).

However, the address information for Mark Hamill has no address concept at all, just street and city. This information may have come from a schema such as Stars (name, street, city) that only has the ability to represent one address for a star. Some of the other variations in schema that are not reflected in the tiny example of "Semistructured Data Representation" Figure (a), but that could be present if movie information were acquired from various sources, contain: optional film-type information, a director, a producer or producers, the owning studio, revenue and information on where the movie is currently playing.