Grouping / HAVING Clauses

Grouping / HAVING Clauses

To group tuples, we use a GROUP BY clause, following the WHERE clause. The keywords GROUP BY are followed by a list of grouping attributes. In the simplest situation, there is only one relation reference in the FROM clause, and this relation has its tuples grouped according to their values in the grouping attributes. Whatever aggregation operators are used in the SELECT clause are applied only within groups.

Example 1 : The problem of finding, from the relation

Movie(title, year, length, inColor, studioName, producerC#)

the sum of the lengths of all movies for each studio is expressed by

SELECT studioName, SUM(length)
FROM Movie
GROUP BY studioName;

We may imagine that the tuples of relation Movie are reorganized and grouped so that all the tuples for Disney studios are together, all those for MGM are together, and so on, as was suggested in "Grouping" Figure 1. The sums of the length components of all the tuples in each group are calculated, and for each group, the studio name is printed along with that sum.

Observe in Example 1 how the SELECT clause has two kinds of terms.

1. Aggregations, where an aggregate operator is applied to an attribute or expression involving attributes. As mentioned, these terms are evaluated on a per-group basis.

2. Attributes, such as studioName in this example, that appear in the GROUP BY clause. In a SELECT clause that has aggregations, only those attributes that are mentioned in the GROUP BY clause may appear unaggregated in the SELECT clause.

While queries involving GROUP BY usually have both grouping attributes and aggregations in the SELECT clause, it is  technically not necessary to have both. For instance, we could write

SELECT studioName
FROM Movie
GROUP BY studioName;

This query would group the tuples of Movie according to their studio name and then print the studio name for each group, no  matter how many tuples there are with a given studio name. Thus, the above query has the same effect as

FROM Movie;

It is also possible to use a GROUP BY clause in a query about various relations. Such a query is interpreted by the following sequence of steps:

1. Evaluate the relation R expressed by the FROM and WHERE clauses. That is, relation R is the Cartesian product of the relations mentioned in the FROM clause, to which the selection of the WHERE clause is applied.

2. Group the tuples of R according to the attributes in the GROUP BY clause.

3. Produce as a result the attributes and aggregations of the SELECT clause, as if the query were about a stored relation R.

Example 2 : Assume we wish to print a table listing each producer's total length of film produced. We need to get information from the two relations

Movie(title, year, length, inColor, studioName, producerC#)
MovieExec(name, address, cert#, netWorth)

so we begin by taking their theta-join, equating the certificate numbers from the two relations. That step gives us a relation in which each MovieExec tuple is paired with the Movie tuples for all the movies of that producer. Note that an executive who is not a producer will not be paired with any movies, and therefore will not appear in the relation. Now, we can group the selected tuples of this relation according to the name of the producer. Eventually, we sum the lengths of the movies in each group. The query is shown in Figure 1.

Computing the length of movies for each producer

HAVING Clauses
Assume that we did not wish to include all of the producers in our table of Example 2. We could restrict the tuples prior to grouping in a way that would make undesired groups empty. For example, if we only wanted the total length of movies for producers with a net worth of more than $10,000,000, we could change the third line of Figure 1 to

WHERE producerC# = cert# AND networth > 1OOOOOOO

On the other hand, sometimes we want to choose our groups based on some aggregate property of the group itself. Then we follow the GROUP BY clause with a HAVING clause. The latter clause consists of the keyword HAVING followed by a condition about the group.

Example 3 : Assume we want to print the total film length for only those producers who made at least one film prior to 1930. We may append to Figure 1 the clause

Grouping, Aggregation, and Nulls

HAVING MIN(year) < 1930

The resulting query, shown in Figure 2, would remove from the grouped relation all those groups in which every tuple had a year component 1930 or higher.

Computing the total length of film for early producers

There are many rules we must remember about HAVING clauses:

●  An aggregation in a HAVING clause applies only to the tuples of the group being tested.

●  Any attribute of relations in the FROM clause may be aggregated in the HAVING clause, but only those attributes that are in  the GROUP BY list may appear unaggregated in the HAVING clause (the same rule as for the SELECT clause).

Order of Clauses in SQL Queries