Grouping

Grouping

Sometimes we do not want simply the average or some other aggregation of an entire column. Rather, we need to examine the tuples of a relation in groups, equivalent to the value of one or more other columns, and we aggregate only within each group. As an example, assume we wanted to compute the total number of minutes of movies produced by each studio, i.e.. a relation such as:


Starting with the relation

Movie(title, year, length, inColor, studioName, producerC#)

from our example database schema of "Relational Algebra", we must group the tuples according to their value for attribute studioName. We must then sum the length column within each group. That is, we imagine that the tuples of Movie are grouped as suggested in Figure 1, and we apply the aggregation SUM(length) to each group separately.

A relation with imaginary division into groups

The Grouping Operator         

Lets now introduce an operator that allows us to group a relation and/or aggregate some columns. If there is grouping, then the aggregation is within groups.

The subscript used with the γ operator is a list L of elements, each of which is either:

a)  An attribute of the relation R to which the γ is applied; this attribute is one of the attributes by which R will be grouped. This element is said to be a grouping attribute.

b)  An aggregation operator applied to an attribute of the relation. To provide a name for the attribute corresponding to this aggregation in the result, an arrow and new name are appended to the aggregation. The underlying attribute is said to be an aggregated attribute.

The relation returned by the expression γL(R) is constructed as follows;

1. Partition the tuples of R into groups. Each group consists of all tuples having one particular assignment of values to the grouping attributes in the list L. If there are no grouping attributes, the entire relation R is one group.

2. For each group, produce one tuple consisting of:


i. The grouping attributes values for that group and
ii. The aggregations, over all tuples of that group, for the aggregated attributes on list L.

Example 1 : Assume we have the relation

StarsIn(title, year, starName)

and we wish to find, for each star who has appeared in at least three movies, the earliest year in which they appeared. The first step is to group: using starName as a grouping attribute. We clearly must compute for each group the MIN(year) aggregate. On the other hand, in order to decide which groups satisfy the condition that the star appears in at least three movies, we must also compute the COUNT(title) aggregate for each group.

We begin with the grouping expression

The first two columns of the result of this expression are required for the query result. The third column is an auxiliary attribute, which we have named ctTitle: it is required to find out whether a star has appeared in at least three movies. That is, we continue the algebraic expression for the query by selecting for ctTitle >= 3 and then projecting onto the first two columns. An expression tree for the query is shown in Figure 2.
Algebraic expression tree for the SQL query of Example 1


Tags