Semistructured Data Representation

Semistructured Data Representation

A database of semistructured data is a collection of nodes. Each node is either a leaf or interior. Leaf nodes have associated data; the type of this data can be any atomic type, such as numbers and strings. Interior nodes have one or more arcs out. Each arc has a label, which shows how the node at the head of the arc relates to the node at the tail. One interior node, called the root, has no arcs entering and represents the entire database. Every node must be reachable from the root, although the graph structure is not necessarily a tree.

Example (a): Figure (a) is an example of a semistructured database about stars and movies. We see a node at the top labeled Root; this node is the entry point to the data and may be thought of as representing all the information in the database.  The central objects or entities - stars and movies in this case - are represented by nodes that are children of the root.

Semistructured data representing a movie and stars

We also see many leaf nodes. At the far left is a leaf labeled Carrie Fisher, and at the far right is a leaf labeled 1977, for instance. There are also many interior nodes. Three particular nodes we have labeled cf, mh and sw, standing for "Carrie Fisher", "Mark Hamill" and "Star Wars" respectively. These labels are not part of the model, and we placed them on these nodes only so we would have a way of referring to the nodes, which otherwise would be nameless. We may think of node sw, for instance, as representing the concept "Star Wars"; the title and year of this movie, other information not shown, such as its length, and its stars, two of which are shown.

The labels on arcs play two roles, and therefore combine the information contained in class definitions and relationships. Assume we have an arc labeled L from node N to node M.

1. It may be possible to think of N as representing an object or struct, while M represents one of the attributes of the object or fields of the struct. Then, L represents the name of the attribute or field, respectively.

2. We may be able to think of N and M as objects, and L as the name of a relationship from N to M.

Example (b): Look at Figure (a) again. The node indicated by cf may be thought of as representing the Star object for Carrie Fisher. We see, leaving this node, an arc labeled name, which represents the attribute name and properly leads to a leaf node holding the correct name. We also see two arcs, each labeled address. These arcs lead to unnamed nodes which we may think of as representing the two addresses of Carrie Fisher. Together, these arcs represent the set-valued attribute address as in "Representing Set-Valued Attributes" Figure (a).
Each of these addresses is a struct, with fields street and city. We notice in Figure (a) how both nodes have out-arcs labeled street and city. Furthermore, these arcs each lead to leaf nodes with the appropriate atomic values.

The other kind of arc also appears in Figure (a). For example, the node cf has an out-arc leading to the node sw and labeled starsIn. The node mh (for Mark Hamill) has a similar arc, and the node sw has arcs labeled starOf to both nodes cf and mh.  These arcs represent the stars-in relationship between stars and movies.