*Introduction to Selection of Indexes*

Selection of indexes requires a trade-off by the database designer, and in fact, this choice is one of the principal factors that influence whether a database design is acceptable. Two important factors to examine are:

● The existence of an index on an attribute greatly speeds up queries in which a value for that attribute is specified, and in some cases can speed up joins involving that attribute as well.

● However, every index built for an attribute of some relation makes insertions, deletions, and updates to that relation more difficult and time-consuming.

Index selection is one of the most difficult parts of database design, since it requires estimating what the typical mix of queries and other operations on the database will be. If a relation is queried much more frequently than it is modified, then indexes on the attributes that are most frequently specified in queries make sense. Indexes are useful for attributes that tend to be compared with constants in WHERE clauses of queries, but indexes also are useful for attributes that appear frequently in join conditions.**Example 1 :** Remember "Queries Involving More Than One Relation" Figure 1 where we suggested an exhaustive pairing of tuples to compute a join. An index on Movie.title would help us find the Movie tuple for Star Wars quickly and then after finding its producer-certificate-number, an index on MovieExec.cert# would help us quickly find that person in the MovieExec relation.

If modifications are the predominant action, then we should be very conservative about creating indexes. Even then, it may be an efficiency gain to create an index on a frequently used attribute. In fact, since some modification commands involve querying the database (e.g.. an INSERT with a select-from-where subquery or a DELETE with a condition) one must be very careful how one estimates the relative frequency of modifications and queries.

We do not yet have the details - how data is typically stored and how indexes are implemented - that are required to see the complete picture. On the other hand, we can see part of the problem in the following example. We should be aware that the typical relation is stored over many disk blocks, and the principal cost of a query or modification is often the number of disk blocks that need to be brought to main memory. Therefore, indexes that let us find a tuple without examining the entire relation can save a lot of time. On the other hand, the indexes themselves have to be stored, at least partially, on disk, so accessing and modifying the indexes themselves cost disk accesses. In fact, modification, since it requires one disk access to read a block and another disk access to write the changed block, is about twice as expensive as accessing the index or the data in a query.**Example 2 :** Let us look at the relation**StarsIn(movieTitle, movieYear, starName)**

Assume that there are three database operations that we sometimes perform on this relation:

Q_{1}: We look for the title and year of movies in which a given star appeared. That is, we execute a query of the form:**SELECT movieTitle, movieYearFROM StarsInWHERE starName = s ;**

for some constant s.

Q

_{2}: We look for the stars that appeared in a given movie. That is, we execute a query of the form:

**SELECT starName**

FROM StarsIn

WHERE movieTitle = t AND movieYear = y;

FROM StarsIn

WHERE movieTitle = t AND movieYear = y;

for constants t and y.

I: We insert a new tuple into StarsIn. That is, we execute an insertion of the form:

**INSERT INTO StarsIn VALUES(t, y, s) ;**

for constants t, y, and s.

Let us make the following assumptions about the data:

1. StarsIn is stored in 10 disk blocks, so if we need to examine the entire relation the cost is 10.

2. On the average, a star has appeared in 3 movies and a movie has 3 stars.

3. Since the tuples for a given star or a given movie are likely to be spread over the 10 disk blocks of StarsIn, even if we have an index on starName or on the combination of movieTitle and movieYear, it will take 3 disk accesses to find the (average of) 3 tuples for a star or movie. If we have no index on the star or movie, respectively, then 10 disk accesses are required.

4. One disk access is required to read a block of the index every time we use that index to locate tuples with a given value for the indexed attribute(s). If the index block must be modified (in the case of an insertion), then another disk access is required to write back the modified block.

5. Similarly, in the case of an insertion, one disk access is required to read a block on which the new tuple will be placed, and another disk access is required to write back this block. We assume that, even without an index, we can find some block on which an additional tuple will fit without scanning the entire relation.

Figure 1 gives the costs of each of the three operations: Q

_{1}(query given a star), Q

_{2}(query given a movie), and I (insertion). If there is no index, then we must scan the entire relation for Q

_{1}or Q

_{2}(cost 10), while an insertion requires merely that we access a block with free space and rewrite it with the new tuple (cost of 2, since we assume that block can be found without an index). These observations explain the column labeled "No Index".

If there is an index on stars only, then Q

_{2}still requires a scan of the entire relation (cost 10). On the other hand, Q

_{1}can be answered by accessing one index block to find the three tuples for a given star and then making three more accesses to find those tuples. Insertion I requires that we read and write both a disk block for the index and a disk block for the data, for a total of 4 disk accesses.

The case where there is an index on movies only is symmetric to the case for stars only. Lastly, if there are indexes on both stars and movies, then it takes 4 disk accesses to answer either Q

_{1}or Q

_{2}. On the other hand, insertion I requires that we read and write two index blocks as well as a data block, for a total of 6 disk accesses. That observation explains the last column in Figure 1.

The final row in Figure 1 gives the average cost of an action, on the assumption that the fraction of the time we do Q

_{1}is p

_{1}and the fraction of the time we do Q

_{2}is p

_{2}: therefore, the fraction of the time we do I is 1 - p

_{1}- p

_{2}.

Depending on p

_{1}and p

_{2}, any of the four choices of index / no index can yield the best average cost for the three actions. For instance, if p

_{1}= p

_{2}= 0.1, then the expression 2 + 8p

_{1}+ 8p

_{2}is the smallest, so we would prefer not to create any indexes. That is, if we are doing mostly insertion, and very few queries, then we don't want an index. However, if p

_{1}= p

_{2}= 0.4, then the formula 6 - 2p

_{1}- 2p

_{2}turns out to be the smallest, so we would prefer indexes on both starName and on the (movieTitle, movieYear) combination. Intuitively, if we are doing a lot of queries, and the number of queries specifying movies and stars are roughly equally frequent, then both indexes are desired.

If we have p

_{1}= 0.5 and p

_{2}= 0.1, then it turns out that an index on stars only gives the best average value, because 4 + 6p

_{2}is the formula with the smallest value. Similarly, p

_{1}= 0.1 and p

_{2}= 0.5 tells us to create an index on only movies. The intuition is that if only one type of query is frequent, create only the index that helps that type of query.

### Tags

- attributes
- database
- tuples
- index
- Scrolling Cursors
- Protecting Against Concurrent Updates
- Modifications by Cursor
- Cursors
- Instead-Of Triggers
- Triggers in SQL
- Schema-Level Constraints and Triggers
- Tuple-Based CHECK Constraints
- Constraints on Attributes and Tuples
- Declaring Foreign-Key Constraints
- Keys Declared With UNIQUE
- Constraints and Triggers
- Modifying Views
- View Definitions
- Default Values / Indexes
- Simple Table Declarations
- Defining a Relation Schema in SQL
- Deletion / Updates
- Database Modifications
- Grouping / HAVING Clauses
- Full-Relation Operations
- Natural Joins / Outerjoins
- Subqueries in FROM Clauses
- Conditions Involving Tuples
- Subqueries
- Union, Intersection, and Difference of Queries
- Interpreting Multirelation Queries
- Tuple Variables
- Disambiguating Attributes
- Queries Involving More Than One Relation
- Null Values and Comparisons Involving NULL
- Selection in SQL
- Projection in SQL
- The Database Language SQL
- Additional Constraint Examples
- Extending the Projection Operator
- Grouping
- Extended Operators of Relational Algebra
- Selection on Bags / Product of Bags / Joins of Bags
- Union, Intersection, and Difference of Bags
- Relational Operations on Bags
- A Linear Notation for Algebraic Expressions
- Dependent and Independent Operations
- Renaming
- Combining Operations to Form Queries
- Selection / Cartesian Product
- Set Operations on Relations
- An Algebra of Relational Operations
- Relational Algebra
- Attribute Lists
- Information Integration Via Semistructured Data
- Semistructured Data Representation
- Object-Oriented Versus Object-Relational
- Nested Relations
- The Object-Relational Model
- What If There Is No Key
- Representing ODL Relationships
- Representing Other Type Constructors
- Representing Set-Valued Attributes
- Nonatomic Attributes in Classes
- Declaring Keys in ODL
- Extents
- Subclasses in ODL / Multiple Inheritance in ODL
- Types in ODL
- Methods in ODL
- Multiplicity of Relationships
- Relationships in ODL / Inverse Relationships
- Attributes in ODL
- Introduction to ODL
- The Type System
- Review of Object-Oriented Concepts
- Decomposition into Fourth Normal Form
- Reasoning About Multivalued Dependencies
- Definition of Multivalued Dependencies
- Multivalued Dependencies
- Third Normal Form
- Boyce-Codd Normal Form
- Decomposing Relations
- Projecting Functional Dependencies
- Closing Sets of Functional Dependencies
- The Transitive Rule
- Why the Closure Algorithm Works
- Computing the Closure of Attributes
- Trivial Functional Dependencies
- The Splitting/Combining Rule
- Rules About Functional Dependencies
- Keys of Relations
- Using Null Values to Combine Relations - Comparison of Approaches
- An Object-Oriented Approach
- Converting Subclass Structures to Relations
- Handling Weak Entity Sets
- Combining Relations
- From E/R Relationships to Relations
- From Entity Sets to Relations
- From E/R Diagrams to Relational Designs
- Relation Instances
- Equivalent Representations of a Relation
- Tuples / Domains
- Attributes / Schemas
- The Relational Data Model
- Summary of The Entity-Relationship Data Model
- Weak Entity Set Notation
- Requirements for Weak Entity Sets
- Representing Keys in the E/R Model
- Keys in the E/R Model
- The Modeling of Constraints
- Picking the Right Kind of Element
- Choosing the Right Relationships
- Design Principles
- Subclasses in the E/R Model
- Converting Multiway Relationships to Binary
- Attributes on Relationships
- Multiway Relationships
- Elements of the E/R Model
- Information Integration Overview
- Database System Implementation
- Database Design
- Storage and Buffer Management
- Data-Definition Language Commands
- Multimedia Data
- Smaller and Smaller Systems
- Relational Database Systems
- Corporate Records
- Early Database Management Systems
- The Evolution of Database Systems
- The Worlds of Database Systems