SQL Training with Recursive Queries and Windowing Functions

INTRODUCTION

In this article I'm going to show some useful SQL using "Recursive Queries" and "Windowing Functions". So, the article is meant to intermediate level SQL programmers or those who want to know a little more about this language. Detailed explanations about syntax can be obtained in Microsoft documentation (Windowing | Recursivity). It's not the goal of this article to explain all that again. So, have a look there to go deeper into it.

BACKGROUND

I will use Northwind Database to build query examples. Click previous link to download or obtain more "how to" information about installation. Examples have been tested in Microsoft SQL Server 2012.

RECURSIVE QUERIES

One of the best ways to obtain recursivity with SQL is by using Common Table Expressions (CTEs). I already explained in more detail this kind of objects in Recursive SQL with CTEs post. The most important thing to take into account is to know that these expressions can be self-referencing and consequently recursivity can be obtained. That being said, the next relevant issue is to know how to build these kind of queries.

The most used basic structure to write Recursive SQL with CTEs is by means of two SELECT statement joined by a UNION ALL operator. The former SELECT is known as the "anchor member" whilst the latter is known as the "recursive member". Thus, you can guess that second SELECT query reference previous output results to add new rows to the next output and so on. Then, "recursive member" will be executed as many times as new rows can be obtained. At last, "recursive query" will be finished when no results are returned. Now, let's see an example using Northwind database. I'll build a query to show all "descendant" employees from a given employee to obtain all "children" list.

Our CTE will start by 1) Selecting Employees with EmployeeId equals to value of previously-defined variable @EmployeeID and 2) Selecting direct children who reports to @EmployeeId. This will be the FIRST iteration. The next one will look up for employees whose ManagerId has already been selected in prior result sets and so on until new results can't be obtained. On the other hand, Level field will show the hierarchical level of every employee in the company being "0" the highest one.  Every time Recursive Member is executed Level field will increment its value by one.

Copy Code
declare @EmployeeId int
set @EmployeeId=2;
with CTE as (
/* Anchor member */
select e.EmployeeID, e.FirstName + ' ' + e.LastName as Name, 
	ReportsTo as ManagerId, 
	0 as Level
from Employees e
where E.EmployeeID=@EmployeeId
UNION ALL
/* Recursive Member */
select e.EmployeeID, e.FirstName + ' ' + e.LastName as Name, 
	ReportsTo as ManagerId, 
	Level+1 as Level
from Employees e
join CTE on
	e.ReportsTo=CTE.EmployeeID)
select * from CTE
option(MAXRECURSION 500)

Here are the results:

image

In the example above, "Andrew Fuller" has no ManagerId, so he takes top 1 in the hierarchy with a 0 level value. Below him are "Nancy", "Janet", "Margaret", "Steven" and "Laura" who are direct "children" with "1" level value. At last, Michael, Robert and Annency take level "2" and have no others employees below them.

In addition to this, I would like to point out that number of iterations is limited to 100 by default. This setting prevents query from never-ending loops in case of not correct queries. If this limit is reached, an exception will be thrown. So, it's important to know that you can change this value using MAXRECURSION query hint and setting a value between 0 and 32767. Setting MAXRECURSION to 0 implies no limit in iterations. In the example I have set a value of 500 to this option. As I said before, this might be dangerous when dealing with incorrectly built iterative CTEs. So, take care about.

QUERIES USING WINDOWING FUNCTIONS

Windowing functions were introduced by Microsoft in SQL Server 2005. Briefly explained, these functions let you open a "window frame" for each particular row returned by the query "over" the final result set. This "window frame" is partitioned according to a group of fields as well as ordered within each partition in accordance with another group of fields. Then, it is possible to apply aggregatedranking or analytic functions over each partition. For each returned row in results, each partition is selected based on current row values for fields included in the configuration of that previously created partition.

Regarding basic syntax, OVER clause is used to open a "window frame", PARTITION BY clause is used to create partitions in the "window frame" and ORDER BY clause to order values within each partition. If PARTITION BY is not specified windowing functions will be applied to all rows returned by the query. Likewise, if ORDER BY is not specified "window frame" will not be ordered.

In addition to this, for more advanced programmers, ROWS or RANGE clauses can be optionally used to restrict even more the group of rows within the partition. ROWS clause filters rows based on fixed position forward and backward from current row being processed whilst RANGE function can filter rows in partitions according to logical range values. In this post we'll see a basic example with ROWS function, leaving detailed explanations for others articles. These clauses demand use of ORDER BY clause to order rows in partitions.

Here is the basic syntax for the OVER clause:

OVER (PARTITION BY [your fields]
      ORDER BY [your fields]
      ROWS|RANGE [your expression])

With windowing functions you can obtain information for your application very easily. In my case, these functions have been very handy when dealing with reports. They provide you a great choice for obtaining a vast range of different calculated fields. Hence, you should bear in mind these functions when outlining your queries.

The best way to learn is by means of examples, so let's get straight to the point.

Simple aggregations using SUM function

To begin with, I'll build a query to show the Orders list for each employee, showing destination Country and Freight Value. To apply "Windowing" functionality I'll show two new fields in the final result set, one of them to calculate the sum of Freight sold by every employee, and the other to show the sum of Freight sold by every Employee in every Country.

Here is the code:

Copy Code
/* Query example using WINDOWING 'sum' operator */
select OrderId, EmployeeID, ShipCountry, Freight,
	   SUM(Freight) OVER (PARTITION BY EmployeeID) as FreightByEmployee,
	   SUM(Freight) OVER (PARTITION BY EmployeeID, ShipCountry)  as FreightByEmployeeAndCountry
FROM Orders
order by EmployeeID, ShipCountry

Explanation:

For each current row in results, two new fields are obtained: FreightByEmployee and FreightByEmployeeAndCountry. The first one is calculated by opening a window frame OVER result set, creating PARTITION BY EmployeeIdand finally aggregating Freight field by using SUM function. In a similar way, FreightByEmployeeAndCountryfield is calculated by opening a new window frameOVER the result set, making PARTITION BY EmployeeId and ShipCountry and then applying Sum aggregated function for Freight field. As I said before, each partition is selected based on current row values for fields used to create partitions.

You can note that, there's no sense to use ORDER BY within the partition for the purposes of this example.

Here are the results:

image

For example, employee with EmployeeID=1 has sold products containing 577.09 freight units in all the countries. These values are shown in FreightByEmployee field. And, more in detail, he has sold products containing 250.93 freight units in Germany, 38.62 in USA, etc. These values are shown in FreightByEmployeeAndCountry field.

Calculating accumulated values with ROWS clause

In this example we are going to obtain a result set to see the Sales Amount for each employee per year, cumulative amount per year and total amount taking into account all the years. So, the result set to obtain would be:

image

It's important to note the information showed by CumulativeAmount field. It sums Amount field value from first row until current row for each customer. So, partition is even more restricted, from the first row to the current one. Thus, cumulative values can be calculated.

For example, "Antonio Moreno Taqueria" has 403.20$ in 1996, 6363.97$ accumulated between 1996 and 1997, and 7023.97$ taking into account from 1996 to 1998.

That being said, this is the code to obtain this result:

Copy Code
;with cte as (
select c.CompanyName,
	Year(orderDate) as [Period],
	cast(sum(round((1.00 - od.Discount) * od.UnitPrice * od.Quantity,2)) as money) as Amount
from [Orders] O
join Customers c on c.CustomerID=o.CustomerID
join [Order Details] od on od.OrderID=o.OrderID
group by CompanyName, Year(orderDate))
select CompanyName, [Period], Amount,
	   SUM(Amount) OVER (PARTITION BY CompanyName 
						   ORDER BY Period ASC 
						   ROWS UNBOUNDED PRECEDING) as CumulativeAmount,
	   SUM(Amount) OVER (PARTITION BY CompanyName) as TotalAmount
from cte
order by CompanyName asc, Period asc

As you can see above, the code is using ROWS clause with UNBOUNDED PRECEDING. This means that partition is restricted to take rows from first row to current one previously ordered by Period field in ascending order. I would like to point out that ORDER BY clause is mandatory when using ROWS or RANGE clauses. If not, there's no sense using these clauses.

Calculating ranking with RANK function

In this example, I will show how to obtain a ranking of employees based on 1) The sum of Freight sold and 2) The count of orders managed. This case, our query make use of RANKfunction to obtain results.

Here is the code:

Copy Code
/* Query example using WINDOWING 'rank' operator */
select EmployeeId, 
	   sum(Freight) as FreightTotal,
	   COUNT(*) as NumOrders,
	   RANK() OVER (ORDER BY SUM(Freight) DESC) as RankingBySumFreight,
	   RANK() OVER (ORDER BY COUNT(*) DESC) as RankingByNumOrders
FROM Orders
group by EmployeeID

Here are the results:

image

Explanation:

The query above lists information for all rows in Orders table showing EmployeeIDfield and  sum of Freight and count of orders for each employee using GROUP BY clause. Besides, two new fields are added using ranking functions. One of them based on SUM of Freight and the other according to COUNT of orders for employees. To calculate these values is very important to note the use of RANK function.

For each row, a "window frame" is opened OVER the total result set (there's no PARTITION BY clause), rows within the "window frame" are ORDERED BY FreightTotal for calculating RankingBySumFreight and NumOrders for RankingByNumOrders field. Finally values of ranking for the EmployeeID being processed in the current row are returned from the RANK function.

For example, employee with EmployeeId=6 takes 8th place according to RankingBySumFreight but 7th place in accordance with RankingByNumOrders.

Another point to mention would be that if two or more equal values are encountered while calculating rankings, same ranking value is assigned. Then, by using RANK function, there will be so many "gaps" for ranking values as number of identical values have been found. For diving into this, we'll have a quick look to differences between RANK and DENSE_RANK functions in the next example.

Comparing rankings with RANK and DENSE_RANK

As I said before, this example shows differences between RANK and DENSE_RANK windowing functions. In a nutshell, DENSE_RANK doesn't produce "gaps" while RANK does. Let's see it with an example. I will build a query to show the rank of employees according to count of orders for "Switzerland".  Thus, the difference will appear clearly.

Here is the code:

Copy Code
/* Query example using WINDOWING 'Rank' and 'Dense_rank' operator */
select EmployeeId, 
	   COUNT(*) as NumOrders,
	   RANK() OVER (ORDER BY COUNT(*) DESC) as RankingByNumOrdersWithRank,
	   DENSE_RANK() OVER (ORDER BY COUNT(*) DESC) as RankingByNumOrdersWithDenseRank
FROM Orders
where Orders.ShipCountry='Switzerland'
group by EmployeeId

And here are the results:

image

Explanation:

As you can see above, employee with EmployeeID=4 has four orders being the person who has sold the most. For this, he is assigned a top one in ranking. Then, employees with EmployeeID in "3" and "7", who have three orders, are assigned a top two in ranking. So far, any differenceBut, what is happening from this point? Well, Employees with EmployeeId in "6", "9" and "1" have two orders but RANK sets a value of "4" in ranking whereas DENSE_RANK sets "3". Why? This is because RANK function is omitting value "3" in ranking whereas DENSE_RANK is not. Finally, It should be noted that for each repeated value in ranking, RANK generates as many "gaps" as same values are obtained minus one. This is the reason for value "7" for employees with EmployeeID in "5" and "8". It sets "7" in ranking because there are three employees with ranking "4". So, "5" and "6" ranking values are omitted.

Generating sequencies with ROW_NUMBER function

And what about ROW_NUMBER function? Which way is it related to all this? Let's see with another example based on previous one.

Here is the new code:

Copy Code
select EmployeeId, 
	   COUNT(*) as NumOrders,
	   RANK() over (ORDER BY COUNT(*) DESC) as RankingByNumOrdersWithRank,
	   DENSE_RANK() OVER (ORDER BY COUNT(*) DESC) as RankingByNumOrdersWithDenseRank,
	   ROW_NUMBER() OVER (ORDER BY COUNT(*) DESC) as Row_Number
FROM Orders
where Orders.ShipCountry='Switzerland'
group by EmployeeId

And here are the results:

image

As you can see ROW_NUMBER function orders records according to ORDER BY clause within the previously-defined partitions and assign a UNIQUE sequential number for each row starting from 1.

Creating ranking groups with NTILE function

But, any other interesting ranking function? Yes. There is another relevant function you should know. This is NTILEfunction.

Let's suppose that we want to divide our employees in four different groups based on the number of issued orders, so as the employees are categorised from more to less orders, and then placed in a group. We'll call these groups, with "Very High", "High", "Normal" and "Low". How to get this quickly and efficiently? Again, It's easy. You only have to use NTILE function.

NTILE function divides rows of an ordered partition into a specified number of groups. This number is passed as an input param to the function. It must be and int or bigint positive number. Then, for each row, NTILE returns the group number to which the row belongs according to the ordered partition.

Let's see an example:

Copy Code
 /* Query example using WINDOWING 'NTILE' function */ 
 select EmployeeId, COUNT(*) as NumOrders, 
        NTILE(4) OVER (ORDER BY COUNT(*) DESC) [Group], 
        case NTILE(4) OVER (ORDER BY COUNT(*) DESC) 
         when 1 then 'Very High' 
         when 2 then 'High' 
         when 3 then 'Normal' 
         else 'Low'   end as [GroupDescription] 
FROM Orders 
group by EmployeeId

And here are the results:

image

As you can see above, employees have been divided into four different groups. I have used two different fields to show new calculated groups using integer and text values. Group field is calculated directly using NTILE function with "4" as an input param value, and GroupDescription only use a CASE statement to obtain descriptions based on values returned by NTILE function.

As you can note, there are nine employees, so it's impossible to create four different groups with the same number of employees. This is the reason for Group "1" having one more component.


I hope this article is useful for you.

Comments (2) -

  • I wish the images were present in this article ...
    • Thanks for your warning. I'll fix the issues with the images as soon as I can.

Add comment