Full-Relation Operations

Full-Relation Operations

In this section we shall look at some operations that act on relations as a whole, rather than on tuples individually or in small numbers (as do joins of various relations, for example). First, we deal with the fact that SQL uses relations that are bags rather than sets, and a tuple can appear more than once in a relation. We shall see how to force the result of an operation to be a set in "Eliminating Duplicates", and in "Duplicates in Unions, Intersections, and Differences" we shall see that it is also possible to prevent the elimination of duplicates in circumstances where SQL systems would usually eliminate them.

Then, we discuss how SQL supports the grouping and aggregation operator  that we introduced in "Grouping". SQL has aggregation operators and a GROUP-BY clause. There is also a "HAVING" clause that allows selection of certain groups in a way that depends on the group as a whole, rather than on individual tuples.

Eliminating Duplicates

As mentioned in "Correlated Subqueries", SQL's notion of relations differs from the abstract notion of relations presented in "The Relational Data Model". A relation, being a set, cannot have more than one copy of any given tuple. When an SQL query creates a new relation, the SQL system does not ordinarily eliminate duplicates. Thus, the SQL response to a query may list the same tuple many times.

Recall from "Interpreting Multirelation Queries" that one of many equivalent definitions of the meaning of an SQL select-from-where query is that we begin with the Cartesian product of the relations referred to in the FROM clause. Each tuple of the product is tested by the condition in the WHERE clause, and the ones that pass the test are given to the output for projection according to the SELECT clause. This projection may cause the same tuple to result from different tuples of the product, and if so, each copy of the resulting tuple is printed in its turn. In addition, since there is nothing wrong with an SQL relation having duplicates, the relations from which the Cartesian product is formed may have duplicates, and each identical copy is paired with the tuples from the other relations, yielding a proliferation of duplicates in the product.

If we do not wish duplicates in the result, then we may follow the keyword SELECT by the keyword DISTINCT. That word tells SQL to produce only one copy of any tuple and is the SQL analog of applying the  operator of "Extended Operators of Relational Algebra" to the result of the query.

Example 1 : Let us reconsider the query of "Conditions Involving Tuples" Figure 3, where we asked for the producers of Harrison Ford's movies using no  subqueries. As written, George Lucas will appear several times in the output. If we want only to see each producer once, we may change line (1) of the query to

1)    SELECT DISTINCT name

The Cost of Duplicate Elimination

Then, the list of producers will have duplicate occurrences of names eliminated before printing.

Incidentally, the query of "Conditions Involving Tuples" Figure 1, where we used subqueries, does not necessarily suffer from the problem of duplicate answers. True, the subquery at line (4) of "Conditions Involving Tuples" Figure 1 will produce the certificate number of George Lucas several times. However, in the "main" query of line (1), we look at each tuple of MovieExec once. Most probably, there is only one tuple for George Lucas in that relation, and if so, it is only this tuple that satisfies the WHERE clause of line (3). In this way, George Lucas is printed only once.

Duplicates in Unions, Intersections, and Differences

Unlike the SELECT statement, which preserves duplicates as a default and only eliminates them when instructed to by the DISTINCT keyword, the union. inter-section, and difference operations, which we introduced in "Union, Intersection, and Difference of Queries", usually eliminate duplicates. That is, bags are converted to sets, and the set version of the operation is applied. In order to prevent the elimination of duplicates, we must follow the operator UNION, INTERSECT, or EXCEPT by the keyword ALL. If we do, then we get the bag semantics of these operators as was discussed in "Union, Intersection, and Difference of Bags".

Example 2 : Consider again the union expression from "Union, Intersection, and Difference of Queries" Example 3, but now add the keyword ALL, as:

(SELECT title, year FROM Movie)
             UNION ALL
(SELECT movieTitle AS title, movieYear AS year FROM StarsIn);

Now, a title and year will appear as many times in the result as it appears in each of the relations Movie and StarsIn put together. For example, if a movie appeared once in the Movie relation and there were three stars for that movie listed in StarsIn  (so the movie appeared in three different tuples of StarsIn), then that movie's title and year would appear four times in the result of the union.

As for union, the operators INTERSECT ALL and EXCEPT ALL are intersection and difference of bags. Thus, if R and S are relations, then the result of expression.

R INTERSECT ALL S

is the relation in which the number of times a tuple t appears is the minimum of the number of times it appears in R and the number of times it appears in S.

The result of expression

R EXCEPT ALL S

has tuple t as many times as the difference of the number of times it appears in R minus the number of times it appears in S,  provided the difference is positive. Each of these definitions is what we discussed for bags in "Union, Intersection, and Difference of Bags".

Grouping and Aggregation in SQL

In "Grouping", we introduced the grouping-and-aggregation operator y for our extended relational algebra. Remember that this operator allows us to partition the tuples of a relation into "groups", based on the values of tuples in one or more attributes, as discussed in "Grouping". We are then able to aggregate certain other columns of the relation by applying "aggregation" operators to those columns. If there are groups, then the aggregation is done separately for each group. SQL provides all the capability of the operator through the use of aggregation operators in SELECT clauses and a special GROUP BY clause.

Aggregation Operators

SQL uses the five aggregation operators SUM, AVG, MIN, MAX and COUNT that we met in "Extended Operators of Relational Algebra". These operators are used by applying them to a scalar-valued expression, usually a column name, in a SELECT clause. One exception is the expression COUNT(*), which counts all the tuples in the relation that is constructed from the FROM clause and WHERE clause of the query.

Further, we have the option of eliminating duplicates from the column before applying the aggregation operator by using the keyword DISTINCT. That is, an expression such as COUNT(DISTINCT x) counts the number of distinct values in column x. We could use any of the other operators in place of COUNT here, but expressions such as SUM(DISTINCT x) rarely make sense,  since it asks us to sum the different values in column x.

Example 3 : The following query finds the average net worth of all movie executives:

SELECT AVG(netWorth)
FROM MovieExec;

Note that there is no WHERE clause at all, so the keyword WHERE is properly omitted. This query examines the netWorth column of the relation

MovieExec(name, address, cert#, netWorth)

sums the values found there, one value for each tuple (even if the tuple is a duplicate of some other tuple), and divides the sum by the number of tuples. If there are no duplicate tuples, then this query gives the average net worth as we expect. If there were  duplicate tuples, then a movie executive whose tuple appeared n times would have his or her net worth counted n times in the average.

Example 4 : The following query:

SELECT COUNT (*)
FROM StarsIn;

counts the number of tuples in the StarsIn relation. The similar query:

SELECT COUNT(starName)
FROM StarsIn;

counts the number of values in the starName column of the relation. Since, duplicate values are not eliminated when we project onto the starName column in SQL, this count should be the same as the count produced by the query with COUNT (*).

If we want to be certain that we do not count duplicate values more than once, we can use the keyword DISTINCT before the aggregated attribute, as:

SELECT COUNT(DISTINCT starName)
FROM StarsIn;

Now each star is counted once, no matter in how many movies they appeared.

Tags