Introduction to Selection of Indexes

Introduction to Selection of Indexes

Selection of indexes requires a trade-off by the database designer, and in fact, this choice is one of the principal factors that influence whether a database design is acceptable. Two important factors to examine are:

●  The existence of an index on an attribute greatly speeds up queries in which a value for that attribute is specified, and in some cases can speed up joins involving that attribute as well.

●  However, every index built for an attribute of some relation makes insertions, deletions, and updates to that relation more difficult and time-consuming.

Index selection is one of the most difficult parts of database design, since it requires estimating what the typical mix of queries and other operations on the database will be. If a relation is queried much more frequently than it is modified, then indexes on the attributes that are most frequently specified in queries make sense. Indexes are useful for attributes that tend to be compared with constants in WHERE clauses of queries, but indexes also are useful for attributes that appear frequently in join conditions.

Example 1 : Remember "Queries Involving More Than One Relation" Figure 1 where we suggested an exhaustive pairing of tuples to compute a join. An index on Movie.title would help us find the Movie tuple for Star Wars quickly and then after finding its producer-certificate-number, an index on MovieExec.cert# would help us quickly find that person in the MovieExec relation.

If modifications are the predominant action, then we should be very conservative about creating indexes. Even then, it may be an efficiency gain to create an index on a frequently used attribute. In fact, since some modification commands involve querying the database (e.g.. an INSERT with a select-from-where subquery or a DELETE with a condition) one must be very careful how one estimates the relative frequency of modifications and queries.

We do not yet have the details - how data is typically stored and how indexes are implemented - that are required to see the complete picture. On the other hand, we can see part of the problem in the following example. We should be aware that the typical relation is stored over many disk blocks, and the principal cost of a query or modification is often the number of disk blocks that need to be brought to main memory. Therefore, indexes that let us find a tuple without examining the entire relation can save a lot of time. On the other hand, the indexes themselves have to be stored, at least partially, on disk, so accessing and modifying the indexes themselves cost disk accesses. In fact, modification, since it requires one disk access to read a block and another disk access to write the changed block, is about twice as expensive as accessing the index or the data in a query.

Example 2 : Let us look at the relation

StarsIn(movieTitle, movieYear, starName)

Assume that there are three database operations that we sometimes perform on this relation:

Q1: We look for the title and year of movies in which a given star appeared. That is, we execute a query of the form:

SELECT movieTitle, movieYear
FROM StarsIn
WHERE starName = s ;

for some constant s.

Q2: We look for the stars that appeared in a given movie. That is, we execute a query of the form:

SELECT starName
FROM StarsIn
WHERE movieTitle = t AND movieYear = y;

for constants t and y.

I: We insert a new tuple into StarsIn. That is, we execute an insertion of the form:

INSERT INTO StarsIn VALUES(t, y, s) ;

for constants t, y, and s.

Let us make the following assumptions about the data:

1. StarsIn is stored in 10 disk blocks, so if we need to examine the entire relation the cost is 10.

2. On the average, a star has appeared in 3 movies and a movie has 3 stars.

3. Since the tuples for a given star or a given movie are likely to be spread over the 10 disk blocks of StarsIn, even if we have an index on starName or on the combination of movieTitle and movieYear, it will take 3 disk accesses to find the (average of) 3 tuples for a star or movie. If we have no index on the star or movie, respectively, then 10 disk accesses are required.

4. One disk access is required to read a block of the index every time we use that index to locate tuples with a given value for the indexed attribute(s). If the index block must be modified (in the case of an insertion), then another disk access is required to write back the modified block.

5. Similarly, in the case of an insertion, one disk access is required to read a block on which the new tuple will be placed, and another disk access is required to write back this block. We assume that, even without an index, we can find some block on which an additional tuple will fit without scanning the entire relation.

Costs associated with the three actions as a function of which indexes are selected

Figure 1 gives the costs of each of the three operations: Q1 (query given a star), Q2 (query given a movie), and I (insertion). If there is no index, then we must scan the entire relation for Q1 or Q2 (cost 10), while an insertion requires merely that we access a block with free space and rewrite it with the new tuple (cost of 2, since we assume that block can be found without an index). These observations explain the column labeled "No Index".

If there is an index on stars only, then Q2 still requires a scan of the entire relation (cost 10). On the other hand, Q1 can be answered by accessing one index block to find the three tuples for a given star and then making three more accesses to find those tuples. Insertion I requires that we read and write both a disk block for the index and a disk block for the data, for a total of 4 disk accesses.

The case where there is an index on movies only is symmetric to the case for stars only. Lastly, if there are indexes on both stars and movies, then it takes 4 disk accesses to answer either Q1 or Q2. On the other hand, insertion I requires that we read and write two index blocks as well as a data block, for a total of 6 disk accesses. That observation explains the last column in Figure 1.

The final row in Figure 1 gives the average cost of an action, on the assumption that the fraction of the time we do Q1 is p1 and the fraction of the time we do Q2 is p2: therefore, the fraction of the time we do I is 1 - p1 - p2.

Depending on p1 and p2, any of the four choices of index / no index can yield the best average cost for the three actions. For instance, if p1 = p2 = 0.1, then the expression 2 + 8p1 + 8p2 is the smallest, so we would prefer not to create any indexes. That is, if we are doing mostly insertion, and very few queries, then we don't want an index. However, if p1 = p2 = 0.4, then the formula 6 - 2p1 - 2p2 turns out to be the smallest, so we would prefer indexes on both starName and on the (movieTitle, movieYear) combination. Intuitively, if we are doing a lot of queries, and the number of queries specifying movies and stars are roughly equally frequent, then both indexes are desired.

If we have p1 = 0.5 and p2 = 0.1, then it turns out that an index on stars only gives the best average value, because 4 + 6p2 is the formula with the smallest value. Similarly, p1 = 0.1 and p2 = 0.5 tells us to create an index on only movies. The intuition is that if only one type of query is frequent, create only the index that helps that type of query.