Database System Implementation

Database System Implementation

The subject of database system implementation can be divided roughly into three sections:

1. Storage management: how secondary storage is used effectively to hold data and allow it to be accessed quickly.

2. Query processing: how queries expressed in a very high-level language such as SQL can be executed efficiently.

3. Transaction management: how to support transactions with the ACID properties discussed in "Transaction Processing"

Each of these topics is covered by several articles of this blog.

Storage-Management Overview

"Data Storage" introduces the memory hierarchy. However, since secondary storage, especially disk, is so central to the way a DBMS manages data, we observe in the greatest detail the way data is stored and accessed on disk. The "block model" for disk-based data is introduced; it influences the way almost everything is done in a database system.

"Representing Data Elements" relates the storage of data elements - relations, tuples, attribute-values, and their equivalents in other data models - to the requirements of the block model of data. Then we look at the important data structures that are used for the construction of indexes. Recall that an index is a data structure that supports efficient access to data. "Index Structures" covers the important one-dimensional index structures - indexed-sequential files, B-trees, and hash tables. These indexes are commonly used in a DBMS to support queries in which a value for an attribute is given and the tuples with that value are desired. B-trees also are used for access to a relation sorted by a given attribute. "Multidimensional and Bitmap Indexes" discusses multidimensional indexes, which are data structures for specialized applications such as geographic databases, where queries usually ask for the contents of some region. These index structures can also support complex SQL queries that limit the values of two or more attributes, and some of these structures are beginning to appear in commercial DBMS's.

Query-Processing Overview

"Query Execution" covers the basics of query execution. We learn a number of algorithms for efficient execution of the operations of relational algebra. These algorithms are designed to be efficient when data is stored on disk and are in some cases rather different from analogous main-memory algorithms.

In "The Query Compiler" we consider the architecture of the query compiler and optimizer. We begin with the parsing of queries and their semantic checking. Next, we consider the conversion of queries from SQL to relational algebra and the selection of a logical query plan, that is, an algebraic expression that represents the particular operations to be performed on data and the necessary constraints regarding order of operations. Finally, we discover the selection of a physical query plan, in which the particular order of operations and the algorithm used to implement each operation have been specified.

Transaction-Processing Overview

In "Coping With System Failures" we see how a DBMS supports durability of transactions. The central idea is that a log of all changes to the database is made. Anything that is in main-memory but not on disk can be lost in a crash (say, if the power supply is interrupted). Therefore we have to be careful to move from buffer to disk, in the proper order, both the database changes themselves and the log of what changes were made. There are several log strategies available, but each limits our freedom of action in some ways.

Then, we take up the matter of concurrency control - assuring atomicity and isolation - in "Concurrency Control". We view transactions as sequences of operations that read or write database elements. The major topic of this portion is how to manage locks on database elements: the different types of locks that may be used, and the ways that transactions may be allowed to obtain locks and release their locks on elements. Also studied are a number of ways to assure atomicity and isolation without using locks.

"More About Transaction Management" concludes our study of transaction processing. We consider the interaction between the requirements of logging, as discussed in "Coping With System Failures", and the requirements of concurrency that were discussed in "Concurrency Control". Handling of deadlocks, another important function of the transaction manager is covered here as well. The extension of concurrency control to a distributed environment is also considered in "More About Transaction Management". Finally, we introduce the possibility that transactions are "long", taking hours or days rather than milliseconds. A long transaction cannot lock data without causing chaos among other potential users of that data, which forces us to rethink concurrency control for applications that involve long transactions.