SQL Tip: Windowing Functions

,

A simple SELECT returns all relevant rows from the database. GROUP BY applies aggregate functions, condensing SELECTed data into one summary row per grouping. What if mixing the two approaches is desired—what if data rows need to be combined with aggregate totals in the same result set?

T-SQL’s OVER clause allows several kinds of functions to be applied to non-GROUPed BY rows. Using OVER, a simple SELECT returning data rows can be expanded to include columns containing summary statistics. Traditionally, aggregate functions can only be used in conjunction with GROUP BY or to produce a single row summary result set; when used with OVER, these limitations are removed.

Suppose you’ve been asked to help your child’s high school athletic department by pulling relay race statistics. In the data table you’ve been given, each row records a runner’s time in a particular race. Also included is the team name and the position (in relay race terms, the leg) the runner ran. In the simplistic world of this data set, each runner’s name is guaranteed to be unique across all teams.

Row-Relative Aggregates

The first request you’ve been given is to show the runner’s overall average time next to each per-race time in the result set. This will allow athletic staff to compare how a runner ran in a particular race with how that runner usually performs. “SELECT Runner, AVG(Time) FROM RaceTime GROUP BY Runner” produces the desired average statistics but doesn’t output them side-by-side with per-race data.

Instead of using this traditional approach, let’s try an aggregate function combined with OVER:

SELECT *
	, RunnerOverallAverage = AVG(Time) OVER (PARTITION BY Runner)
FROM RaceTime

OVER defines the set of rows—known as a window—to which the associated function will be applied. When the average function is computed for each row, the function “looks” through the window defined by OVER. Only the rows that the window allows it to “see” are included in its computation.

PARTITION BY Runner divides the result set into groups based on the runner’s name and then sets OVER’s window so that only those rows in the current partition—only those rows having the same runner as the current row—can be seen. The average is then computed using these rows. The end result is the current runner’s overall race time average outputted as a column appended to the current row in the result set.

This windowing only applies to the function associated with the OVER clause; it does not limit, group or change the data returned by other parts of the query. Multiple functions can be used with different OVER clauses to compute statistics using different windows. For example:

SELECT *
	, RunnerOverallAverage = AVG(Time) OVER (PARTITION BY Runner)
	, TeamRaceTotal = SUM(Time) OVER (PARTITION BY Race, Team)
FROM RaceTime
ORDER BY Race, Team

Running Totals

Another statistical request you’ve been given is to provide cumulative elapsed time for each runner for each team-race combination. Elapsed time for a runner is the time from the race’s start to when the given runner finishes. For example, elapsed time for a team’s second runner is first runner’s time plus second runner’s time.

To calculate cumulative time, we somehow need to take the window used for team-race total time (shown in the SQL example above) and shrink it so that only rows up through the current runner’s Leg are seen by the aggregate function. That is, when the total for the first runner (the runner with Leg = 1) is computed, sum should only see the first runner’s row. Then, when the total for the second runner is computed, sum’s window should “slide open” to include the Leg = 2 row in addition to the previous runner’s row, and so forth.

Adding ORDER BY Leg to OVER’s clause accomplishes this:

SELECT *
	, CulumativeTeamRaceTime = SUM(TIME) OVER (PARTITION BY Race, Team ORDER BY Leg)
FROM RaceTime
ORDER BY Race, Team, Leg

OVER‘s ORDER BY sorts the current partition and changes OVER’s window from a fixed window containing all rows in the partition to a sliding window containing all rows in the partition ordered prior to or the same as to the current row. This ordering is based only on OVER‘s ORDER BY clause; the overall query’s ORDER BY has no effect.

In this case, the window size changes based on the current row’s value of Leg: the window is opened only wide enough for sum to see those rows where Leg is less than or equal to the current row’s Leg. Sum for the runner with Leg = 1 will see all rows in the current partition with a Leg of 1 or less—in this case, just the first runner; sum for the runner with Leg = 2 will see all rows in the current partition with a Leg of 2 or less—in this case, the first and second runners—and so forth.

Moving Averages

For even more elaborate window control, some functions allow OVER’s ORDER BY to be combined with a ROWS or RANGE clause that modifies the number of rows before or after the current row’s position included in OVER’s window.

RunnerOverallAverage (from our first example) can be adapted to compute a two race moving average with the addition of a ROWS modifier:

SELECT *
	, RunnerOverallAverage = AVG(Time) OVER (PARTITION BY Runner)
	, RunnerTwoRaceAverage = AVG(Time) OVER (PARTITION BY Runner ORDER BY Race ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
FROM RaceTime
ORDER BY Runner, Race

Notes

T-SQL’s OVER clause can be used with aggregate, analytic and ranking functions and NEXT VALUE FOR. Most of these functions allow optional use of PARTITION BY. When omitted, the entire row set is treated as a single partition. Some functions require that OVER contain an ORDER BY clause; some do not allow ROWS or RANGE to modify this clause.

Don’t confuse OVER’s partitioning with partitioned tables and indexes—the two concepts are unrelated. OVER is a feature built into T-SQL; partitioned tables and indexes are a Microsoft SQL Server Enterprise Edition feature.

Leave a Reply

Your email address will not be published. Required fields are marked *