| O
racle® Database Data Warehousing Guide 10g Release 1 (10.1) Part Number B10736-01 |
|
|
View PDF |
The following topics provide information about how to improve analytical SQL queries in a data warehou se:
Oracle has enhanced SQL's analytical processing capa bilities by introducing a new family of analytic SQL functions. These analytic functions enable you to calculate:
Rankings and percentiles
Moving window calculations
Lag/lead anal ysis
First/last analysis
Linear regression statistics
Rank ing functions include cumulative distributions, percent rank, and N-tiles. Moving window calculations allow you to find moving and cu mulative aggregations, such as sums and averages. Lag/lead analysis enables direct inter-row references so you can calculate period-t o-period changes. First/last analysis enables you to find the first or last value in an ordered group.
Other enhancements to S
QL include the CASE expression. CASE expressions provide if-then logic useful in many situations.
In Oracle Database 10g, the SQL reporting capability was further enhanced by the introduction of partitioned outer join. Parti tioned outer join is an extension to ANSI outer join syntax that allows users to selectively densify certain dimensions while keeping others sparse. This allows reporting tools to selectively densify dimensions, for example, the ones that appear in their cross-tabul ar reports while keeping others sparse.
To enhance performance, analytic functions can be parallelized: multiple processes can simultaneously execute all of these statements. These capabilities make calculations easier and more efficient, thereby enhancing da tabase performance, scalability, and simplicity.
Analytic functions are classified as described in Table 2 1-1.
| Used For | |
|---|---|
| Ranking< /td> | Calculating ranks, percentiles, and n-tiles of the values in a result set. |
| Windowing | Calculating cumulative and moving aggregates. Works with these functions: SUM, AVG, MIN, <
code>MAX, COUNT, VARIANCE, STDDEV, FIRST_VALUE, LAST_VALUE, a
nd new statistical functions |
| Reporting | Calculating shares, for example, market share. Works with these functions: SUM, AVG, MIN, MAX, COUNT (with/without DISTINCT), VARIANCE, |
| LAG/LEAD | Finding a value in a row a specifie d number of rows from a current row. |
| FIRST/L AST | First or last value in an ordered group. |
| Linear Regression | Calculating linear regression and other statistics (slope, intercept, and so on). |
| Inverse Percentile | The value in a data set that corresponds to a specified percentile. |
| Hypothetical Rank and Distribution | The rank or percentile that a row would have if inserted into a specifie d data set. |
To perform these operations, the analytic fun ctions add several new elements to SQL processing. These elements build on existing SQL to allow flexible and powerful calculation ex pressions. With just a few exceptions, the analytic functions have these new elements. The processing flow is represented in Figure 21-1.
The essential concepts used in analytic functions are:
Processing order
Query processing using analytic
functions takes place in three stages. First, all joins, WHERE, GROUP BY and HAVING clauses are performed. Second, the result set is made available to the analytic functions, and all their calculations take place.
Third, if the query has an ORDER BY clause at its end, the ORDER BY is processed
to allow for precise output ordering. The processing order is shown in Figure 21-1.
Result set partitions
The analytic functions allow users to divide query result sets into groups of rows called partitions
. Note that the term partitions used with analytic functions is unrelated to the table partitions feature. Throughou
t this chapter, the term partitions refers to only the meaning related to analytic functions. Partitions are created after the groups
defined with GROUP BY clauses, so they are available to any aggregate results such as sums and averages. P
artition divisions may be based upon any desired columns or expressions. A query result set may be partitioned into just one partitio
n holding all the rows, a few large partitions, or many small partitions holding just a few rows each.
W indow
For each row in a partition, you can define a sliding window of data. This window determines the range of rows used to p erform the calculations for the current row. Window sizes can be based on either a physical number of rows or a logical interval such as time. The window has a starting row and an ending row. Depending on its definition, the window may move at one or both ends. For instance, a window defined for a cumulative sum function would have its starting row fixed at the first row of its partition, and its ending row would slide from the starting point all the way to the last row of the partition. In contrast, a window defined for a mov ing average would have both its starting and end points slide so that they maintain a constant physical or logical range.
A wi ndow can be set as large as all the rows in a partition or just a sliding window of one row within a partition. When a window is near a border, the function returns results for only the available rows, rather than warning you that the results are not what you want.< /p>
When using window functions, the current row is included during calculations, so you should only specify (n-1) when y ou are dealing with n items.
Current row
Each calculation performed with an analytic fun ction is based on a current row within a partition. The current row serves as the reference point determining the start and end of th e window. For instance, a centered moving average calculation could be defined with a window that holds the current row, the six prec eding rows, and the following six rows. This would create a sliding window of 13 rows, as shown in Figure 21-2.
< font face="arial, helvetica, sans-serif">Figure 21-2 Sliding Window Example

A ranking function computes the rank of a record compared to other records in the data set based on the values of a set of measures. The types of ranking function are:
The RANK and DENSE_RANK functions allow you to rank items in a group, for example, finding th
e top three products sold in California last year. There are two functions that perform ranking, as shown by the following syntax:
RANK ( ) OVER ( [query_partition_clause] order_by_clause ) DENSE_RANK ( ) OVER ( [query_partition_clause] order_by_clause )
The difference between RANK and DENSE_RANK is that DENSE_RANK l
eaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using DENSE_RANK and h
ad three people tie for second place, you would say that all three were in second place and that the next person came in third. The <
code>RANK function would also give three people in second place, but the next person would be in fifth place.
The follo
wing are some relevant points about RANK:
Ascending is the default sort order, which you ma y want to change to descending.
The expressions in the optional PARTITION BY c
lause divide the query result set into groups within which the RANK function operates. That is, RANK gets r
eset whenever the group changes. In effect, the value expressions of the PARTITION BY clause define the res
et boundaries.
If the PARTITION BY clause is missing, then ranks are computed
over the entire query result set.
The ORDER BY clause specifies the measures (
<value expression>) on which ranking is done and defines the order in which rows are sorted in each g
roup (or partition). Once the data is sorted within each partition, ranks are given to each row starting from 1.
The NULLS FIRST | NULLS LAST clause indicates the position of NULL
s in the ordered sequence, either first or last in the sequence. The order of the sequence would make NULLs compa
re either high or low with respect to non-NULL values. If the sequence were in ascending order, then NULLS
FIRST implies that NULLs are smaller than all other non-NULL values and NULLS NULL values. It is the opposite for descending order. See the example in
"Treatment of NULLs".
If the NULLS FIRST | NULL
S LAST clause is omitted, then the ordering of the null values depends on the ASC or DESC arguments. Null values are considered larger than any other values. If the ordering sequence is ASC, then nulls will
appear last; nulls will appear first otherwise. Nulls are considered equal to other nulls and, therefore, the order in which nulls ar
e presented is non-deterministic.
The following example shows how the [ASC | DESC] opt
ion changes the ranking order.
Example 21-1 Ranking Order
SELECT channel_desc, TO_CHAR(SUM
(amount_sold), '9,999,999,999') SALES$,
RANK() OVER (ORDER BY SUM(amount_sold)) AS default_rank,
RANK() OVER (ORDER BY SUM(amou
nt_sold) DESC NULLS LAST) AS custom_rank
FROM sales, products, customers, times, channels, countries
WHERE sales.prod_id=products.pro
d_id AND sales.cust_id=customers.cust_id AND
sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND
times.calen
dar_month_desc IN ('2000-09', '2000-10') AND country_iso_code='US'
GROUP BY channel_desc;
CHANNEL_DESC SALES$ DEFAU
LT_RANK CUSTOM_RANK
-------------------- -------------- ------------ -----------
Direct Sales 2,443,392 3
1
Partners 1,365,963 2 2
Internet 467,478 1 3
<
/pre>
While the data in this result is ordered on the measure SALES$, in general, it is not guaranteed by the
RANK function that the data will be sorted on the measures. If you want the data to be sorted on SALES$ in your r
esult, you must specify it explicitly with an ORDER BY clause, at the end of the SELECT statem
ent.
Ranking functions need to
resolve ties between values in the set. If the first expression cannot resolve ties, the second expression is used to resolve ties a
nd so on. For example, here is a query ranking three of the sales channels over two months based on their dollar sales, breaking ties
with the unit sales. (Note that the TRUNC function is used here only to create tie values for this query.)
Example 21-2 Ranking On Multiple Expressions
SELECT channel_desc, calendar_month_desc, TO_CHAR(TRUNC(S
UM(amount_sold),-5),
'9,999,999,999') SALES$, TO_CHAR(SUM(quantity_sold), '9,999,999,999')
SALES_Count, RANK() OVER (ORDER BY TR
UNC(SUM(amount_sold), -5) DESC, SUM(quantity_sold) DESC) AS col_rank
FROM sales, products, customers, times, channels
WHERE sales.pro
d_id=products.prod_id AND sales.cust_id=customers.cust_id AND
sales.time_id=times.time_id AND sales.channel_id=channels.channel_id
AND
times.calendar_month_desc IN ('2000-09', '2000-10') AND
channels.channel_desc<>'Tele Sales'
GROUP BY channel_desc,
calendar_month_desc;
CHANNEL_DESC CALENDAR SALES$ SALES_COUNT COL_RANK
-------------------- -------- -----------
--- -------------- ---------
Direct Sales 2000-10 1,200,000 12,584 1
Direct Sales 2000-09
1,200,000 11,995 2
Partners 2000-10 600,000 7,508 3
Partners 200
0-09 600,000 6,165 4
Internet 2000-09 200,000 1,887 5
Internet
2000-10 200,000 1,450 6
The sales_count column breaks the ties for three pai
rs of values.
The difference be
tween RANK and DENSE_RANK functions is illustrated as follows:
Example 21-3 RANK and DENSE_RANK
SELECT channel_desc, calendar_month_desc,
TO_CHAR(TRUNC(SUM(amount_sold),-4), '9,999,999,999') SAL
ES$,
RANK() OVER (ORDER BY TRUNC(SUM(amount_sold),-4) DESC) AS RANK,
DENSE_RANK() OVER (ORDER BY TRUNC(SUM(amount_sold),-4) DES
C) AS DENSE_RANK
FROM sales, products, customers, times, channels
WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cu
st_id AND
sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND
times.calendar_month_desc IN ('2000-09', '200
0-10') AND
channels.channel_desc<>'Tele Sales'
GROUP BY channel_desc, calendar_month_desc;
CHANNEL_DESC CALENDAR
SALES$ RANK DENSE_RANK
-------------------- -------- -------------- --------- ----------
Direct Sales 2000-09
1,200,000 1 1
Direct Sales 2000-10 1,200,000 1 1
Partners 2000-09
600,000 3 2
Partners 2000-10 600,000 3 2
Internet 2000-09
200,000 5 3
Internet 2000-10 200,000 5 3
Note that, in t
he case of DENSE_RANK, the largest rank value gives the number of distinct values in the data set.
The RANK function can be made to operate within grou
ps, that is, the rank gets reset whenever the group changes. This is accomplished with the PARTITION BY cla
use. The group expressions in the PARTITION BY subclause divide the data set into groups within which RANK operates. For example, to rank products within each channel by their dollar sales, you could issue the following stateme
nt.
Exam ple 21-4 Per Group Ranking Example 1
SELECT channel_desc, calendar_month_desc, TO_
CHAR(SUM(amount_sold),
'9,999,999,999') SALES$, RANK() OVER (PARTITION BY channel_desc
ORDER BY SUM(amount_sold) DESC) AS RANK_BY_
CHANNEL
FROM sales, products, customers, times, channels
WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND
sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND
times.calendar_month_desc IN ('2000-08', '2000-
09', '2000-10', '2000-11')
AND channels.channel_desc IN ('Direct Sales', 'Internet')
GROUP BY channel_desc, calendar_month_desc
;
A single query block can contain more than one ranking function, each partitioning the data into different groups (that
is, reset on different boundaries). The groups can be mutually exclusive. The following query ranks products based on their dollar s
ales within each month (rank_of_product_per_region) and within each channel (rank_of_product_total).
Example 21-5 Per Group Ranking Example 2
SELECT channel_desc, cal
endar_month_desc, TO_CHAR(SUM(amount_sold),
'9,999,999,999') SALES$, RANK() OVER (PARTITION BY calendar_month_desc
ORDER BY SUM(a
mount_sold) DESC) AS RANK_WITHIN_MONTH, RANK() OVER (PARTITION
BY channel_desc ORDER BY SUM(amount_sold) DESC) AS RANK_WITHIN_CHANNE
L
FROM sales, products, customers, times, channels
WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND
sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND
times.calendar_month_desc IN ('2000-08', '2000-09',
'2000-10', '2000-11')
AND channels.channel_desc IN ('Direct Sales', 'Internet')
GROUP BY channel_desc, calendar_month_desc;
CHANNEL_DESC CALENDAR SALES$ RANK_WITHIN_MONTH RANK_WITHIN_CHANNEL
---------------
----- -------- -------------- ----------------- -------------------
Direct Sales 2000-08 1,236,104
1 1
Internet 2000-08 215,107 2 4
Direct Sales
2000-09 1,217,808 1 3
Internet 2000-09 228,241 2
3
Direct Sales 2000-10 1,225,584 1 2
Internet 2000
-10 239,236 2 2
Direct Sales 2000-11 1,115,239 1
4
Internet 2000-11 284,742 2 1
Analytic functions, RANK for example, ca
n be reset based on the groupings provided by a CUBE, ROLLUP, or GROUPING SETS op
erator. It is useful to assign ranks to the groups created by CUBE, ROLLUP, and GROUPING SETS queries. See Chapter 20, " SQL for Aggregation in Data Warehouses" for further informa
tion about the GROUPING function.
A sample CUBE and ROLLUP query is the following:
SELECT channel_desc, country_iso_code,
TO_CHAR(SUM(amount_sold), '9,999,999,999')
SALES$, RANK() OVER (
PARTITION BY GROUPING_ID(channel_desc, country_iso_code)
ORDER BY SUM(amount_sold) DESC) AS RANK_PER_GROUP
FROM sales, customers,
times, channels, countries
WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND
sales.channel_id = channels.
channel_id AND channels.channel_desc
IN ('Direct Sales', 'Internet') AND times.calendar_month_desc='2000-09'
AND country_iso_
code IN ('GB', 'US', 'JP')
GROUP BY CUBE(channel_desc, country_iso_code);
CHANNEL_DESC CO SALES$ RANK_PER_GROUP
----
---------------- -- -------------- --------------
Direct Sales GB 1,217,808 1
Direct Sales JP
1,217,808 1
Direct Sales US 1,217,808 1
Internet GB 228,241 4
Internet JP 228,241 4
Internet US 228,241 4
Direct Sales
3,653,423 1
Internet 684,724 2
GB 1,446,049
1
JP 1,446,049 1
US 1,446,049 1
4,338,147 1
NULLs are treated like normal values. Also, for rank computation, a NULL value is assumed to be equal to a
nother NULL value. Depending on the ASC | DESC options provided for measures and the NUL
LS FIRST | NULLS LAST clause, NULLs will either sort low or high and hence
, are given ranks appropriately. The following example shows how NULLs are ranked in different cases:
SELECT times.time_id time, sold,
RANK() OVER (ORDER BY (sold) DESC NULLS LAST) AS NLAST_DESC,
RANK() OVER (ORDER BY
(sold) DESC NULLS FIRST) AS NFIRST_DESC,
RANK() OVER (ORDER BY (sold) ASC NULLS FIRST) AS NFIRST,
RANK() OVER (ORDER BY (sold) A
SC NULLS LAST) AS NLAST
FROM
(
SELECT time_id , sum(sales.amount_sold) sold
FROM sales, products, customers, countries
WHERE sales.prod_id=products.prod_id AND
sales.cust_id=customers.cust_id AND prod_name IN ('Envoy Ambassador',
'Mouse Pad
') AND country_iso_code ='GB'
GROUP BY time_id)
v, times
WHERE v.time_id (+) =times.time_id AND calendar_year=1999
AND cal
endar_month_number=1
ORDER BY sold DESC NULLS LAST;
TIME SOLD NLAST_DESC NFIRST_DESC NFIRST NLAST
--------- ---
------- ---------- ----------- ---------- ----------
14-JAN-99 25241.48 1 13 31 19
21-JAN-99 24
365.05 2 14 30 18
10-JAN-99 22901.24 3 15 29 17
20-JAN-99 165
78.19 4 16 28 16
16-JAN-99 15881.12 5 17 27 15
30-JAN-99 1563
7.49 6 18 26 14
17-JAN-99 13262.87 7 19 25 13
25-JAN-99 13227
.08 8 20 24 12
03-JAN-99 9885.74 9 21 23 11
28-JAN-99 4471.
08 10 22 22 10
27-JAN-99 3453.66 11 23 21 9
23-JAN-99 925.4
5 12 24 20 8
07-JAN-99 756.87 13 25 19 7
08-JAN-99 571.8
14 26 18 6
13-JAN-99 569.21 15 27 17 5
02-JAN-99 316.87
16 28 16 4
12-JAN-99 195.54 17 29 15 3
26-JAN-99 92.96
18 30 14 2
19-JAN-99 86.04 19 31 13 1
05-JAN-99
20 1 1 20
01-JAN-99 20 1 1 20
31-JAN-99
20 1 1 20
11-JAN-99 20 1 1 20
06-JAN-99
20 1 1 20
18-JAN-99 20 1 1 20
09-JAN-99
20 1 1 20
29-JAN-99 20 1 1 20
22-JAN-99
20 1 1 20
04-JAN-99 20 1 1 20
24-JAN-99
20 1 1 20
15-JAN-99 20 1 1 20
Bottom N is similar to top N except for the ordering sequence within the
rank expression. Using the previous example, you can order SUM(s_amount) ascending instead of descending.
The CUME_DIST function (defined as the inverse of percentile in some statistical books)
computes the position of a specified value relative to a set of values. The order can be ascending or descending. Ascending is the d
efault. The range of values for CUME_DIST is from greater than 0 to 1. To compute the CUME_DIST of a value
x in a set S of size N, you use the formula:
CUME_DIST(x) = number of values in S coming before an d including x in the specified order/ N
Its syntax is:
CUME_DIST ( ) OVER ( [query_partition _clause] order_by_clause )
The semantics of various options in the CUME_DIST function are similar to those i
n the RANK function. The default order is ascending, implying that the lowest value gets the lowest CUME_DIST (as all other values come later than this value in the order). NULLs are treated the same as they are in the RA
NK function. They are counted toward both the numerator and the denominator as they are treated like non-NULL val
ues. The following example finds cumulative distribution of sales by channel within each month:
SELECT
calendar_month_desc AS MONTH, channel_desc,
TO_CHAR(SUM(amount_sold) , '9,999,999,999') SALES$,
CUME_DIST() OVER (PARTITION
BY calendar_month_desc ORDER BY
SUM(amount_sold) ) AS CUME_DIST_BY_CHANNEL
FROM sales, products, customers, times, channels
WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND
sales.time_id=times.time_id AND sales.channel_id=
channels.channel_id AND
times.calendar_month_desc IN ('2000-09', '2000-07','2000-08')
GROUP BY calendar_month_desc, channel_de
sc;
MONTH CHANNEL_DESC SALES$ CUME_DIST_BY_CHANNEL
-------- -------------------- -------------- -----------------
---
2000-07 Internet 140,423 .333333333
2000-07 Partners 611,064 .6666666
67
2000-07 Direct Sales 1,145,275 1
2000-08 Internet 215,107 .33333333
3
2000-08 Partners 661,045 .666666667
2000-08 Direct Sales 1,236,104 1
2000-09 Internet 228,241 .333333333
2000-09 Partners 666,172 .666666667
2000-09 Direct Sales 1,217,808 1
PERCENT_
RANK is similar to CUME_DIST, but it uses rank values rather than row counts in its numerator. Therefore, it retu
rns the percent rank of a value relative to a group of values. The function is available in many popular spreadsheets. PERCENT_
RANK of a row is calculated as:
(rank of row in its partition - 1) / (number of rows in the part ition - 1)
PERCENT_RANK returns values in the range zero to one. The row(s) with a rank of 1 will have a
PERCENT_RANK () OVER ([query_partition_clause] order_by_ clause)
NTILE allows easy calculation of tertiles, quartiles, deciles and
other common summary statistics. This function divides an ordered partition into a specified number of groups called buckets<
/strong> and assigns a bucket number to each row in the partition. NTILE is a very useful calculation because it lets us
ers divide a data set into fourths, thirds, and other groupings.
The buckets are calculated so that each bucket has exactly th
e same number of rows assigned to it or at most 1 row more than the others. For instance, if you have 100 rows in a partition and ask
for an NTILE function with four buckets, 25 rows will be assigned a value of 1, 25 rows will have value 2, and so on. T
hese buckets are referred to as equiheight buckets.
If the number of rows in the partition does not divide evenly (without a r
emainder) into the number of buckets, then the number of rows assigned for each bucket will differ by one at most. The extra rows wil
l be distributed one for each bucket starting from the lowest bucket number. For instance, if there are 103 rows in a partition which
has an NTILE(5) function, the first 21 rows will be in the first bucket, the next 21 in the second bucket, the next 21
in the third bucket, the next 20 in the fourth bucket and the final 20 in the fifth bucket.
The NTILE function ha
s the following syntax:
NTILE (expr) OVER ([query_partition_clause] order_by_clause)
In thi
s, the N in NTILE(N) can be a constant (for example, 5) or an expression.
This function, like RANK a
nd CUME_DIST, has a PARTITION BY clause for per group computation, an ORDER BY clause for specifying the measures and their sort order, and NULLS FIRST | NULLS NULLs. For example, the following is an example assigning each mont
h's sales total into one of 4 buckets:
SELECT calendar_month_desc AS MONTH , TO_CHAR(SUM(amount_sold),
'9,999,999,999')
SALES$, NTILE(4) OVER (ORDER BY SUM(amount_sold)) AS TILE4
FROM sales, products, customers, times, channels
WHERE
sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND
sales.time_id=times.time_id AND sales.channel_id=channel
s.channel_id AND
times.calendar_year=2000 AND prod_category= 'Electronics'
GROUP BY calendar_month_desc;
MONTH SALES$
TILE4
-------- -------------- ----------
2000-02 242,416 1
2000-01 257,286 1
2000-03
280,011 1
2000-06 315,951 2
2000-05 316,824 2
2000-04 318,106 2
2000-07
433,824 3
2000-08 477,833 3
2000-12 553,534 3
2000-10 652,225 4
2000-11 661,147 4
2000-09 691,449 4
NTILE ORDER BY statements must be fully specified to yield reproducible results. Equal values can get distributed across adjacent buckets. To ens
ure deterministic results, you must order on a unique key.
The ROW_NUMBER
function assigns a unique number (sequentially, starting from 1, as defined by ORDER BY) to each row withi
n the partition. It has the following syntax:
ROW_NUMBER ( ) OVER ( [query_partition_clause] order_by_c lause )
Example 21-6 ROW_NUMBER
SELECT channel_desc, calendar_month_desc,
TO_CHAR(TR
UNC(SUM(amount_sold), -5), '9,999,999,999') SALES$,
ROW_NUMBER() OVER (ORDER BY TRUNC(SUM(amount_sold), -6) DESC) AS ROW_NUMBER
F
ROM sales, products, customers, times, channels
WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND
sa
les.time_id=times.time_id AND sales.channel_id=channels.channel_id AND
times.calendar_month_desc IN ('2001-09', '2001-10')
GROU
P BY channel_desc, calendar_month_desc;
CHANNEL_DESC CALENDAR SALES$ ROW_NUMBER
-------------------- -------- ------
-------- ----------
Direct Sales 2001-09 1,100,000 1
Direct Sales 2001-10 1,000,000 2
I
nternet 2001-09 500,000 3
Internet 2001-10 700,000 4
Partners 2
001-09 600,000 5
Partners 2001-10 600,000 6
Note that there are three pairs
of tie values in these results. Like NTILE, ROW_NUMBER is a non-deterministic function, so each tied value
could have its row number switched. To ensure deterministic results, you must order on a unique key. Inmost cases, that will require
adding a new tie breaker column to the query and using it in the ORDER BY specification.
Windowing functions can be used to compute cumulative, moving, and centered aggregates. They return a value for each row in t
he table, which depends on other rows in the corresponding window. These functions include moving sum, moving average, moving min/max
, cumulative sum, as well as statistical functions. They can be used only in the SELECT and ORDER BY<
/code> clauses of the query. Two other functions are available: FIRST_VALUE, which returns the first value in the window
; and LAST_VALUE, which returns the last value in the window. These functions provide access to more than one row of a t
able without a self-join. The syntax of the windowing functions is:
{SUM|AVG|MAX|MIN|COUNT|STDDEV|VARIA
NCE|FIRST_VALUE|LAST_VALUE}
({value expression1 | *}) OVER
([PARTITION BY value expression2[,...])
ORD
ER BY value expression3 [collate clause>]
[ASC| DESC] [NULLS FIRST | NULLS LAST] [,...]
{ROWS | RANGE} {
BETWEEN
{UNBOUNDED PRECEDING | CURRENT ROW | value_expr {PRECEDING | FOLLOWING}} AND
{ UNBOUNDED FOLLOWING | CURRENT ROW | value_
expr { PRECEDING | FOLLOWING } }
| { UNBOUNDED PRECEDING | CURRENT ROW | value_expr PRECEDING}}
Window functions' NULL semantics match the NULL semantics for SQL aggregate functions. Other seman
tics can be obtained by user-defined functions, or by using the DECODE or a CASE expression within the wind
ow function.
A logical offset can be specified with co
nstants such as RANGE 10 PRECEDING, or an expression that evaluates to a constant, or by an interval specification like
RANGE INTERVAL N DAY/MONTH/YEAR PRECEDING
or an expression that evaluates to an interval. With logical offset, there can only be one expression in the ORDER BY expression list in the function, with type compatible to NUMERIC if offset is numeric, or DATE i
f an interval is specified.
Example 21-7 Cumulative Aggregate Function
The following is an example of cumulative
amount_sold by customer ID by quarter in 1999:
SELECT c.cust_id, t.calendar_quarter_desc,
TO_CHAR (SUM(amount_sold),
'9,999,999,999.99') AS Q_SALES, TO_CHAR(SUM(SUM(amount_sold))
OVER (PARTITION BY c.cust_id ORDER BY c.c
ust_id, t.calendar_quarter_desc
ROWS UNBOUNDED
PRECEDING), '9,999,999,999.99') AS CUM_SALES
FROM sales s, times t, customers c
WH
ERE s.time_id=t.time_id AND s.cust_id=c.cust_id AND t.calendar_year=2000
AND c.cust_id IN (2595, 9646, 11111)
GROUP BY c.cust_id,
t.calendar_quarter_desc
ORDER BY c.cust_id, t.calendar_quarter_desc;
CUST_ID CALENDA Q_SALES CUM_SALES
---------- --
----- ----------------- -----------------
2595 2000-01 659.92 659.92
2595 2000-02 224.79
884.71
2595 2000-03 313.90 1,198.61
2595 2000-04 6,015.08 7,213.69
9646 2000-01 1,337.09 1,337.09
9646 2000-02 185.67 1,522.76
9646 2000-03
203.86 1,726.62
9646 2000-04 458.29 2,184.91
11111 2000-01 43.18 4
3.18
11111 2000-02 33.33 76.51
11111 2000-03 579.73 656.24
11111 2000-04
307.58 963.82
In this example, the analytic function SUM defines, for each row, a win
dow that starts at the beginning of the partition (UNBOUNDED PRECEDING) and ends, by default, at the curren
t row.
Nested SUMs are needed in this example since we are performing a SUM over a value that is its
elf a SUM. Nested aggregations are used very often in analytic aggregate functions.
Example 21-8 Moving Aggregate Function
This example of a time-based window shows, for one customer, the moving average of sales for the current month and preceding two months:
SELECT c.cust_id, t.calendar_month_desc, TO_C
HAR (SUM(amount_sold),
'9,999,999,999') AS SALES, TO_CHAR(AVG(SUM(amount_sold))
OVER (ORDER BY c.cust_id, t.calendar_month_des
c ROWS 2 PRECEDING), '9,999,999,999')
AS MOVING_3_MONTH_AVG
FROM sales s, times t, customers c
WHERE s.time_id=t.time_id AND s.cust_
id=c.cust_id AND
t.calendar_year=1999 AND c.cust_id IN (6510)
GROUP BY c.cust_id, t.calendar_month_desc
ORDER BY c.cust_id, t.c
alendar_month_desc;
CUST_ID CALENDAR SALES MOVING_3_MONTH
---------- -------- -------------- --------------
6510
1999-04 125 125
6510 1999-05 3,395 1,760
6510 1999-06 4,080
2,533
6510 1999-07 6,435 4,637
6510 1999-08 5,105 5,207
6510 1999-09
4,676 5,405
6510 1999-10 5,109 4,963
6510 1999-11 802 3,529
<
p>Note that the first two rows for the three month moving average calculation in the output data are based on a smaller interval size
than specified because the window calculation cannot reach past the data retrieved by the query. You need to consider the different
window sizes found at the borders of result sets. In other words, you may need to modify the query to include exactly what you want.<
/p>Calculating windowing aggregate fu nctions centered around the current row is straightforward. This example computes for all customers a centered moving average of sale s for one week in late December 1999. It finds an average of the sales total for the one day preceding the current row and one day fo llowing the current row including the current row as well.
SELECT t.time_id, TO_CHAR (SUM(amount_sold), '9,999,999,999') AS SALES, TO_CHAR(AVG(SUM(amount_sold)) OVER (ORDER BY t.time_id RANGE BETWEEN INTERVAL '1' DAY PRECEDING AND INTERVAL '1' DAY FOLLOWING), '9,999,999,999') AS CENTERED_3_DAY_AVG FROM sales s, times t WHERE s.time_id=t.time_id AND t.calendar_week_number IN (51) AND calendar_year=1999 GROUP BY t.time_id ORDER BY t .time_id; TIME_ID SALES CENTERED_3_DAY --------- -------------- -------------- 20-DEC-99 134,337 106,676 21 -DEC-99 79,015 102,539 22-DEC-99 94,264 85,342 23-DEC-99 82,746 93,322 24-DEC-99 102,957 82,937 25-DEC-99 63,107 87,062 26-DEC-99 95,123 79,115
The starting and ending rows for each product's centered moving average calculation in the output data are based on just two days, since the window c alculation cannot reach past the data retrieved by the query. Users need to consider the different window sizes found at the borders of result sets: the query may need to be adjusted.
The following example illustrates how window aggregate functions compute values when there are duplicates, that is, when multiple rows are returned for a single ordering value. The query retrieves the quantity sold to several customers during a specified time range. (Although we use an inline view to define our base data set, it has no special sign ificance and can be ignored.) The query defines a moving window that runs from the date of the current row to 10 days earlier.Note th at the RANGE keyword is used to define the windowing clause of this example. This means that the window can potentially hold many row s for each value in the range. In this case, there are three pairs of rows with duplicate date values.
Example 21-10 Windowing Aggregate Func tions with Logical Offsets
SELECT time_id, daily_sum, SUM(daily_sum) OVER (ORDER B Y time_id RANGE BETWEEN INTERVAL '10' DAY PRECEDING AND CURRENT ROW) AS current_group_sum FROM (SELECT time_id, channel_id, SUM(s.qua ntity_sold) AS daily_sum FROM customers c, sales s, countries WHERE c.cust_id=s.cust_id AND s.cust_id IN (638, 634, 753, 440 ) AND s. time_id BETWEEN '01-MAY-00' AND '13-MAY-00' GROUP BY time_id, channel_id); TIME_ID DAILY_SUM CURRENT_GROUP_SUM --------- ------- --- ----------------- 06-MAY-00 161 161 /* 161 */ 10-MAY-00 23 207 /* 161 +(23+23) */ 10-MAY-00 23 207 /* 161 +(23+23) */ 11-MAY-00 46 345 /* 161 +(23+23)+(46+92) * / 11-MAY-00 92 345 /* 161 +(23+23)+(46+92) */ 12-MAY-00 23 368 /* 161 +(23+23)+(4 6+92)+23 */ 13-MAY-00 46 529 /* 161 +(23+23)+(46+92)+23+(46+115) */ 13-MAY-00 115 529 /* 161 +(23+23)+(46+92)+23+(46+115) */
In the output of this example, all dates except May 6 and May 12 return two row s with duplicate dates. Examine the commented numbers to the right of the output to see how the values are calculated. Note that each group in parentheses represents the values returned for a single day.
Note that this example applies only when you use the ROWS keyword. It is also important to remember that with RANGE, yo
u can only use 1 ORDER BY expression in the analytic function's ORDER BY clause. With the ORDER BY clause
.
There are situations where it is useful to vary the size of a window for each row, based on a specified condition. For instance, you may want to make the window larger for certain dates and smaller for others. Assume that you want to calculate the moving average of stock price over three work ing days. If you have an equal number of rows for each day for all working days and no non-working days are stored, then you can use a physical window function. However, if the conditions noted are not met, you can still calculate a moving average by using an expres sion in the window size parameters.
Expressions in a window size specification can be made in several different sources. the e
xpression could be a reference to a column in a table, such as a time table. It could also be a function that returns the appropriate
boundary for the window based on values in the current row. The following statement for a hypothetical stock price database uses a u
ser-defined function in its RANGE clause to set window size:
SELECT t_timekey, AVG(stock_p
rice)
OVER (ORDER BY t_timekey RANGE fn(t_timekey) PRECEDING) av_price
FROM stock, time WHERE st_timekey = t_timekey
ORDER BY
t_timekey;
In this statement, t_timekey is a date field. Here, fn could be a PL/SQL function with
the following specification:
fn(t_timekey) returns
4 if t_timekey is Mo
nday, Tuesday
2 otherwise
If any of the previous days are holidays, it adju sts the count appropriately.
Note that, when window is specified using a number in a window function with ORDE
R BY on a date column, then it is converted to mean the number of days. You could have also used the interval lit
eral conversion function, as NUMTODSINTERVAL(fn(t_timekey), 'DAY') instead of just fn(t_timekey) to mean th
e same thing. You can also write a PL/SQL function that returns an INTERVAL datatype value.
For windows expressed in rows, the ordering expressions should be uniq
ue to produce deterministic results. For example, the following query is not deterministic because time_id is not unique
in this result set.
Example 21-11 Windowing Aggregate Functions With Physical Offsets
SEL
ECT t.time_id, TO_CHAR(amount_sold, '9,999,999,999') AS INDIV_SALE,
TO_CHAR(SUM(amount_sold) OVER (PARTITION BY t.time_id ORDE
R BY t.time_id
ROWS UNBOUNDED PRECEDING), '9,999,999,999') AS CUM_SALES
FROM sales s, times t, customers c
WHERE s.time_id=t.time_id
AND s.cust_id=c.cust_id AND t.time_id IN
(TO_DATE('11-DEC-1999'), TO_DATE('12-DEC-1999')) AND c.cust_id
BETWEEN 6500 AND 6600
ORD
ER BY t.time_id;
TIME_ID INDIV_SALE CUM_SALES
--------- ---------- ---------
12-DEC-99 23 23
12-DEC-99
9 32
12-DEC-99 14 46
12-DEC-99 24 70
12-DEC-99 19 89
<
p>One way to handle this problem would be to add the prod_id column to the result set and order on both time_id and prod_id.The FIRST_VALUE and LAST_VALUE functions allow you to select the first
and last rows from a window. These rows are especially valuable because they are often used as the baselines in calculations. For in
stance, with a partition holding sales data ordered by day, you might ask "How much was each day's sales compared to the first sales
day (FIRST_VALUE) of the period?" Or you might wish to know, for a set of rows in increasing sales order, "What was the
percentage size of each sale in the region compared to the largest sale (LAST_VALUE) in the region?"
If the IGNORE NULLS option is used with FIRST_VALUE, it will return the first non-null value in the set, o
r NULL if all values are NULL. If IGNORE NULLS is used with LAST_VALUE, it will r
eturn the last non-null value in the set, or NULL if all values are NULL. The IGNORE NUL
LS option is particularly useful in populating an inventory table properly.
After a query has been processed, aggregate values like t
he number of resulting rows or an average value in a column can be easily computed within a partition and made available to other rep
orting functions. Reporting aggregate functions return the same aggregate value for every row in a partition. Their behavior with res
pect to NULLs is the same as the SQL aggregate functions. The syntax is:
{SUM | AVG | MAX
| MIN | COUNT | STDDEV | VARIANCE}
([ALL | DISTINCT] {value expression1 | *})
OVER ([PARTITION BY value expression
2[,...]])
In addition, the following conditions apply:
An asterisk (*) is only allowed
in COUNT(*)
DISTINCT is supported only if corresponding aggregate functions al
low it
value expression1 and value expression2 can be any va
lid expression involving column references or aggregates.
The PARTITION BY cla
use defines the groups on which the windowing functions would be computed. If the PARTITION BY clause is ab
sent, then the function is computed over the whole query result set.
Reporting functions can appear only in the ORDER BY clause. The major benefit of reporting functions is their ability to
do multiple passes of data in a single query block and speed up query performance. Queries such as "Count the number of salesmen wit
h sales more than 10% of city sales" do not require joins between separate query blocks.
For example, consider the question "F
or each product category, find the region in which it had maximum sales". The equivalent SQL query using the MAX reporti
ng aggregate function would be:
SELECT prod_category, country_region, sales
FROM (SELECT SUBSTR(p.prod_
category,1,8) AS prod_category, co.country_region, SUM(amount_sold) AS sales,
MAX(SUM(amount_sold)) OVER (PARTITION BY prod_category)
AS MAX_REG_SALES
FROM sales s, customers c, countries co, products p
WHERE s.cust_id=c.cust_id AND c.country_id=co.country_id AND
s.prod_id =p.prod_id AND s.time_id = TO_DATE('11-OCT-2001')
GROUP BY prod_category, country_region)
WHERE sales = MAX_REG_SALES;
The inner query with the reporting aggregate function MAX(SUM(amount_sold)) returns:
PROD_CAT COUNTRY_REGION SALES MAX_REG_SALES -------- -------------------- ---------- ------------- Electron America s 581.92 581.92 Hardware Americas 925.93 925.93 Peripher Americas 3084.4 8 4290.38 Peripher Asia 2616.51 4290.38 Peripher Europe 4290.38 4290.38 Periphe r Oceania 940.43 4290.38 Software Americas 4445.7 4445.7 Software Asia 1408.19 4445.7 Software Europe 3288.83 4445.7 Software Oceania 890.25 4445. 7
The full query results are:
PROD_CAT COUNTRY_REGION SALES -------- ----------- --------- ---------- Electron Americas 581.92 Hardware Americas 925.93 Peripher Europe 4290.38 Software Americas 4445.7
Example 21-12 Reporting Aggregate Example
Reporting aggreg ates combined with nested queries enable you to answer complex queries efficiently. For example, what if you want to know the best se lling products in your most significant product subcategories? The following is a query which finds the 5 top-selling products for ea ch product subcategory that contributes more than 20% of the sales within its product category:
SELECT
SUBSTR(prod_category,1,8) AS CATEG, prod_subcategory, prod_id, SALES
FROM (SELECT p.prod_category, p.prod_subcategory, p.prod_id,
SUM(amount_sold) AS SALES,
SUM(SUM(amount_sold)) OVER (PARTITION BY p.prod_category) AS CAT_SALES,
SUM(SUM(amount_sol
d)) OVER
(PARTITION BY p.prod_subcategory) AS SUBCAT_SALES,
RANK() OVER (PARTITION BY p.prod_subcategory
ORDER BY SUM(amount_sold) ) AS RANK_IN_LINE
FROM sales s, customers c, countries co, products p
WHERE s.cust_id=c.cust_id
AND
c.country_id=co.country_id AND s.prod_id=p.prod_id AND
s.time_id=to_DATE('11-OCT-2000')
GROUP BY p.prod_cat
egory, p.prod_subcategory, p.prod_id
ORDER BY prod_category, prod_subcategory)
WHERE SUBCAT_SALES>0.2*CAT_SALES AND RANK_I
N_LINE<=5;
The RATIO_TO_REPORT function computes the ratio of a value to the sum of a set of value
s. If the expression value expression evaluates to NULL, RATIO_TO_REPORT also eva
luates to NULL, but it is treated as zero for computing the sum of values for the denominator. Its syntax is:
RATIO_TO_REPORT ( expr ) OVER ( [query_partition_clause] )
In this, the following applies:
expr can be any valid expression involving column references or aggregates.
The PARTITION BY clause defines the groups on which the RATIO_TO_REPORT function is to be
computed. If the PARTITION BY clause is absent, then the function is computed over the whole query result s
et.
Example 21-13 RATIO_TO_REPORT
To calculate RATIO_TO_REPORT of sales for each channel,
you might use the following syntax:
SELECT ch.channel_desc, TO_CHAR(SUM(amount_sold),'9,999,999') AS SA
LES,
TO_CHAR(SUM(SUM(amount_sold)) OVER (), '9,999,999') AS TOTAL_SALES,
TO_CHAR(RATIO_TO_REPORT(SUM(amount_sold)) OVER (),
'9.999')
AS RATIO_TO_REPORT
FROM sales s, channels ch
WHERE s.channel_id=ch.channel_id AND s.time_id=to_DATE('11-OCT-2000')
GROUP BY ch.channel_desc;
CHANNEL_DESC SALES TOTAL_SALE RATIO_
-------------------- ---------- ---------- ------
Direc
t Sales 14,447 23,183 .623
Internet 345 23,183 .015
Partners 8,391 23
,183 .362
The LAG and LEAD functions are useful for comparing values when the relative
positions of rows can be known reliably. They work by specifying the count of rows which separate the target row from the current row
. Because the functions provide access to more than one row of a table at the same time without a self-join, they can enhance process
ing speed. The LAG function provides access to a row at a given offset prior to the current position, and the LEAD
function provides access to a row at a given offset after the current position.
These functions hav e the following syntax:
{LAG | LEAD} ( value_expr [, offset] [, default] )
OVER ( [query_partition_
clause] order_by_clause )
offset is an optional parameter and defaults to 1. default is an optional parameter and is the value returned if offset falls outside the bounds of the table or p
artition.
Example 21-14 LAG/LEAD
SELECT time_id, TO_CHAR(SUM(amount_sold),'9,999,999') AS
SALES,
TO_CHAR(LAG(SUM(amount_sold),1) OVER (ORDER BY time_id),'9,999,999') AS LAG1,
TO_CHAR(LEAD(SUM(amount_sold),1) OVER (ORDE
R BY time_id),'9,999,999') AS LEAD1
FROM sales
WHERE time_id>=TO_DATE('10-OCT-2000') AND time_id<=TO_DATE('14-OCT-2000')
GROUP
BY time_id;
TIME_ID SALES LAG1 LEAD1
--------- ---------- ---------- ----------
10-OCT-00 238,479 23,
183
11-OCT-00 23,183 238,479 24,616
12-OCT-00 24,616 23,183 76,516
13-OCT-00 76,516 24,616 29,795
14-OCT-00 29,795 76,516
See "Data Densification for Reporting"
for information showing how to use the LAG/LEAD functions for doing period-to-period comparison querie
s on sparse data.
a>The FIRST/LAST aggregate functions allow you to rank a data set and work with its top-ranked or bottom-ranked rows. Af
ter finding the top or bottom ranked rows, an aggregate function is applied to any desired column. That is, FIRST/
LAST lets you rank on column A but return the result of an aggregate applied on the first-ranked or last-ranked rows of column
B. This is valuable because it avoids the need for a self-join or subquery, thus improving performance. These functions' syntax begi
ns with a regular aggregate function (MIN, MAX, SUM, AVG, COUNT, STDDEV) that produces a single return value per group. To specify the ranking used, the FIRST<
/code>/LAST functions add a new clause starting with the word KEEP.
These funct ions have the following syntax:
aggregate_function KEEP ( DENSE_RANK LAST ORDER BY
expr [ DESC | ASC
] [NULLS { FIRST | LAST }]
[, expr [ DESC | ASC ] [NULLS { FIRST | LAST }]]...)
[OVER query_partitioning_clause]
Note t
hat the ORDER BY clause can take multiple expressions.
You can use the FIRST/LAST family of aggregates as regular aggregate functions.
Example 21-15 FIRST/LAST Example 1
The following query lets us compare minimum pric e and list price of our products. For each product subcategory within the Men's clothing category, it returns the following:
List price of the product with the lowest minimum price
Lowest minimum price
List price of the product with the highest minimum price
Highest minimum price
SELECT prod_subcategory, MIN(prod_list_price) KEEP (DENSE_RAN K FIRST ORDER BY (prod_min_price)) AS LP_OF_LO_MINP, MIN(prod_min_price) AS LO_MINP, MAX(prod_list_price) KEEP (DENSE_RANK LAST ORDER BY (prod_min_price)) AS LP_OF_HI_MINP, MAX(prod_min_price) AS HI_MINP FROM products WHERE prod_category='Electronics' GROUP BY prod _subcategory; PROD_SUBCATEGORY LP_OF_LO_MINP LO_MINP LP_OF_HI_MINP HI_MINP ---------------- ------------- ------- ------------- - --------- Game Consoles 299.99 299.99 299.99 299.99 Home Audio 499.99 499.99 599.99 599.99 Y Box Accessories 7.99 7.99 20.99 20.99 Y Box Games 7.99 7.99 29.99 29 .99
You can also use the FIRST/LAST<
/code> family of aggregates as reporting aggregate functions. An example is calculating which months had the greatest and least incre
ase in head count throughout the year. The syntax for these functions is similar to the syntax for any other reporting aggregate.
Consider the example in Example 21-15 for FIRST/LAST. What if we wanted to find the list pri
ces of individual products and compare them to the list prices of the products in their subcategory that had the highest and lowest m
inimum prices?
The following query lets us find that information for the Documentation subcategory by using FIRST/LAST
code> as reporting aggregates.
Example 21-16 FIRST/LAST Example 2
SELECT prod_id, prod_list_price,
MIN(prod_list_price) KEEP (DENSE_RANK FIRST ORDER BY (prod_min_price))
OVER(
PARTITION BY (prod_subcategory)) AS LP_OF_LO_MINP,
MAX(prod_list_price) KEEP (DENSE_RANK LAST ORDER BY (prod_min_price))
O
VER(PARTITION BY (prod_subcategory)) AS LP_OF_HI_MINP
FROM products WHERE prod_subcategory = 'Documentation';
PROD_ID PROD_LIST_P
RICE LP_OF_LO_MINP LP_OF_HI_MINP
---------- --------------- ------------- -------------
40 44.99 44.99
44.99
41 44.99 44.99 44.99
42 44.99 44.99 44.99
43
44.99 44.99 44.99
44 44.99 44.99 44.99
45 44.99 4
4.99 44.99
Using the FIRST and LAST functions as reporting aggregates makes it easy to
include the results in calculations such "Salary as a percent of the highest salary."
Using the CUME_DIST function, you can find the cumulative distribution (percentile) of a set
of values. However, the inverse operation (finding what value computes to a certain percentile) is neither easy to do nor efficientl
y computed. To overcome this difficulty, the PERCENTILE_CONT and PERCENTILE_DISC functions were introduced.
These can be used both as window reporting functions as well as normal aggregate functions.
These functions need a sort speci
fication and a parameter that takes a percentile value between 0 and 1. The sort specification is handled by using an ORDER BY clause with one expression. When used as a normal aggregate function, it returns a single value for each ordered
set.
PERCENTILE_CONT, which is a continuous function computed by interpolation, and PERCENTILE_DISC,
which is a step function that assumes discrete values. Like other aggregates, PERCENTILE_CONT and PERCENTILE_DISC
operate on a group of rows in a grouped query, but with the following differences:
They require a parameter between 0 and 1 (inclusive). A parameter specified out of this range will result in error. This parameter should be speci fied as an expression that evaluates to a constant.
They require a sort specification. This sort specifi
cation is an ORDER BY clause with a single expression. Multiple expressions are not allowed.
[PERCENTILE_CONT | PERCENTILE_DISC]( constant expression )
WITHIN
GROUP ( ORDER BY single order by expression
[ASC|DESC] [NULLS FIRST| NULLS LAST])
We use the following query to return the 17 rows of data used in the examples of this section:
SELE
CT cust_id, cust_credit_limit, CUME_DIST()
OVER (ORDER BY cust_credit_limit) AS CUME_DIST
FROM customers WHERE cust_city='Marshal
';
CUST_ID CUST_CREDIT_LIMIT CUME_DIST
---------- ----------------- ----------
28344 1500 .173913043
896
2 1500 .173913043
36651 1500 .173913043
32497 1500 .173913043
15192
3000 .347826087
102077 3000 .347826087
102343 3000 .347826087
8270 3000 .34782
6087
21380 5000 .52173913
13808 5000 .52173913
101784 5000 .52173913
304
20 5000 .52173913
10346 7000 .652173913
31112 7000 .652173913
35266
7000 .652173913
3424 9000 .739130435
100977 9000 .739130435
103066 10000 .7826
08696
35225 11000 .956521739
14459 11000 .956521739
17268 11000 .956521739
100
421 11000 .956521739
41496 15000 1
PERCENTILE_DISC(x) is
computed by scanning up the CUME_DIST values in each group till you find the first one greater than or equal to x<
/code>, where x is the specified percentile value. For the example query where PERCENTILE_DISC(0.5), the re
sult is 5,000, as the following illustrates:
SELECT PERCENTILE_DISC(0.5) WITHIN GROUP
(ORDER BY cust_
credit_limit) AS perc_disc, PERCENTILE_CONT(0.5) WITHIN GROUP
(ORDER BY cust_credit_limit) AS perc_cont
FROM customers WHERE cust
_city='Marshal';
PERC_DISC PERC_CONT
--------- ---------
5000 5000
The result of PERCENTILE_CONT
code> is computed by linear interpolation between rows after ordering them. To compute PERCENTILE_CONT(x), we first comp
ute the row number = RN= (1+x*(n-1)), where n is the number of rows in the group and x is the specifie
d percentile value. The final result of the aggregate function is computed by linear interpolation between the values from rows at ro
w numbers CRN = CEIL(RN) and FRN = FLOOR(RN).
The final result will be: PERCENTILE_CONT(X) = if (CRN = FRN = RN), then (value of expression from row at RN) else (CRN - RN) * (value
of expression for row at FRN) + (RN -FRN) * (value of expression for row at CRN).
Consi
der the previous example query, where we compute PERCENTILE_CONT(0.5). Here n is 17. The row number RN = (1 + 0.5*(n-1))= 9 for both groups. Putting this into the formula, (FRN=CRN=9), we return the value from row 9 as
the result.
Another example is, if you want to compute PERCENTILE_CONT(0.66). The computed row number RN
code>=(1 + 0.66*(n-1))= (1 + 0.66*16)= 11.67. PERCENTILE_CONT(0.66) = (12-11.67)*(value of row 11)+(11.67-1
1)*(value of row 12). These results are:
SELECT PERCENTILE_DISC(0.66) WITHIN GROUP
(ORDER BY cust_cr
edit_limit) AS perc_disc, PERCENTILE_CONT(0.66) WITHIN GROUP
(ORDER BY cust_credit_limit) AS perc_cont
FROM customers WHERE cust_c
ity='Marshal';
PERC_DISC PERC_CONT
---------- ----------
9000 8040
Inverse percentile aggregate functions
can appear in the HAVING clause of a query like other existing aggregate functions.
You can also use the aggregate functions PERCENTILE_CONT, PERCENTILE_DISC as rep
orting aggregate functions. When used as reporting aggregate functions, the syntax is similar to those of other reporting aggregates.
[PERCENTILE_CONT | PERCENTILE_DISC](constant expression) WITHIN GROUP ( ORDER BY single order by expression [ASC|DESC] [NULLS FIRST| NULLS LAST]) OVER ( [PARTITION BY value expression [,...]] )
T his query computes the same thing (median credit limit for customers in this result set, but reports the result for every row in the result set, as shown in the following output:
SELECT cust_id, cust_credit_limit, PERCENTILE_DISC(0.5) W
ITHIN GROUP
(ORDER BY cust_credit_limit) OVER () AS perc_disc,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY cust_credit_limit)
OVER () AS perc_cont
FROM customers WHERE cust_city='Marshal';
CUST_ID CUST_CREDIT_LIMIT PERC_DISC PERC_CONT
---------- ----
------------- ---------- ----------
28344 1500 5000 5000
8962 1500 5000
5000
36651 1500 5000 5000
32497 1500 5000 5000
15192 30
00 5000 5000
102077 3000 5000 5000
102343 3000 5000 5000
8270 3000 5000 5000
21380 5000 5000 5000
13808 5000 50
00 5000
101784 5000 5000 5000
30420 5000 5000 5000
10346
7000 5000 5000
31112 7000 5000 5000
35266 7000 5000 50
00
3424 9000 5000 5000
100977 9000 5000 5000
103066 10000
5000 5000
35225 11000 5000 5000
14459 11000 5000 5000
172
68 11000 5000 5000
100421 11000 5000 5000
41496 15000 5000
5000
For PERCEN
TILE_DISC, the expression in the ORDER BY clause can be of any data type that you can sort (numeric,
string, date, and so on). However, the expression in the ORDER BY clause must be a numeric or datetime typ
e (including intervals) because linear interpolation is used to evaluate PERCENTILE_CONT. If the expression is of type <
code>DATE, the interpolated result is rounded to the smallest unit for the type. For a DATE type, the interpolate
d value will be rounded to the nearest second, for interval types to the nearest second (INTERVAL DAY TO SECOND) or to the month(INTERVAL YEAR TO MONTH).
L
ike other aggregates, the inverse percentile functions ignore NULLs in evaluating the result. For example, when you want
to find the median value in a set, Oracle Database ignores the NULLs and finds the median among the non-null values. Yo
u can use the NULLS FIRST/NULLS LAST option in the ORDER BY clause, but they will be ignored as NULLs are ignored.
Example 21-18 Hypothetical Rank and Distributi on Example 2
SELECT prod_subcategory, RANK(10,8) WITHIN GROUP (ORDER BY prod_list_ price DESC,prod_min_price) AS HRANK, TO_CHAR(PERCENT_RANK(10,8) WITHIN GROUP (ORDER BY prod_list_price, prod_min_price),'9.999') AS H PERC_RANK, TO_CHAR(CUME_DIST (10,8) WITHIN GROUP (ORDER BY prod_list_price, prod_min_price),'9.999') AS HCUME_DIST FROM products WHER E prod_subcategory LIKE 'Recordable%' GROUP BY prod_subcategory; PROD_SUBCATEGORY HRANK HPERC_ HCUME_ ----------------- --- ----- ------ ------ Recordable CDs 4 .571 .625 Recordable DVD Discs 5 .200 .333
These functions can appear in the HAVING clause of a query just like other aggregate functions. They cannot
be used as either reporting aggregate functions or windowing aggregate functions.
The regression functions sup port the fitting of an ordinary-least-squares regression line to a set of number pairs. You can use them as both aggregate functions or windowing or reporting functions.
The functions are as follows:
Oracle applies the fun
ction to the set of (e1, e2) pairs after eliminating all pairs for which either of e1 or e2 is null. e1 is interpreted as a value of the dependent variable (a "y value"), and e2 is interpreted as a val
ue of the independent variable (an "x value"). Both expressions must be numbers.
The regression functions are all computed sim
ultaneously during a single pass through the data. They are frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions.
REGR_COUNT returns the number of non-null number pairs used to f
it the regression line. If applied to an empty set (or if there are no (e1, e2) pairs where neither of e1 or e2 is null), the function returns 0.
REGR_AVGY and REGR_AVGX compute the averages of the depend
ent variable and the independent variable of the regression line, respectively. REGR_AVGY computes the average of its fi
rst argument (e1) after eliminating (e1, e2) pairs where either of e1 or e2
is null. Similarly, REGR_AVGX computes the average of its second argument (e2) after null eliminati
on. Both functions return NULL if applied to an empty set.
The REGR_SLOPE function computes
the slope of the regression line fitted to non-null (e1, e2) pairs.
The REGR_INTERCEPT
function computes the y-intercept of the regression line. REGR_INTERCEPT returns NULL whenever slope or th
e regression averages are NULL.
The REGR_R2 function computes the coefficient of determination (usually
called "R-squared" or "goodness of fit") for the regression line.
REGR_R2 returns values between 0 and 1 when the
regression line is defined (slope of the line is not null), and it returns NULL otherwise. The closer the value is to 1
, the better the regression line fits the data.
REGR_SXX, REGR_SYY and REGR_SXY functions are used in computing various diagno
stic statistics for regression analysis. After eliminating (e1, e2) pairs where either of e1 o
r e2 is null, these functions make the following computations:
REGR_SXX: REGR_COUNT(e1, e2) * VAR_POP(e2) REGR_SYY: REGR_COUNT(e1,e2) * VAR_POP(e1) REGR_SXY: REGR_COUNT(e1,e2) * COVAR_POP(e1, e2)
Some common diagnostic statistics that accompany linear regression ana lysis are given in Table 21-2, "Common Diagnostic Statistics and Their Expressions ". Note that this rel ease's new functions allow you to calculate all of these.
Table 21-2 Common Diagnostic Statisti cs and Their Expressions
| Type of Statistic | Expression |
|---|---|
| Adjusted R2 | 1-((1 - REGR_R2)*((REGR_COU
NT-1)/(REGR_COUNT-2))) |
| Standard error | SQRT((REGR_SYY-(POWER(REGR_SXY,2)/REGR_SXX))/(REGR_COUNT-2)) |
| Total sum of squares | REGR_SYY |
| Regress ion sum of squares | POWER(REGR_SXY,2) / REGR_SXX |
| Residual sum of squares | REGR_SYY - (POWER(REGR_SXY,2)/REGR_SXX) |
| t statistic for slope | REGR_SLOPE * SQRT(REGR_SXX) / (Standard error) |
| t statistic for y-int ercept | REGR_INTERCEPT / ((Standard error) * SQRT((1/REGR_COUNT)+(P
OWER(REGR_AVGX,2)/REGR_SXX)) |
In this example, we compute an ordinary-least-squares regression line that expresses
the quantity sold of a product as a linear function of the product's list price. The calculations are grouped by sales channel. The
values SLOPE, INTCPT, RSQR are slope, intercept, and coefficient of determination of the regre
ssion line, respectively. The (integer) value COUNT is the number of products in each channel for whom both quantity sol
d and list price data are available.
SELECT s.channel_id, REGR_SLOPE(s.quantity_sold, p.prod_list_price
) SLOPE,
REGR_INTERCEPT(s.quantity_sold, p.prod_list_price) INTCPT,
REGR_R2(s.quantity_sold, p.prod_list_price) RSQR,
REGR_COUN
T(s.quantity_sold, p.prod_list_price) COUNT,
REGR_AVGX(s.quantity_sold, p.prod_list_price) AVGLISTP,
REGR_AVGY(s.quantity_sold, p
.prod_list_price) AVGQSOLD
FROM sales s, products p WHERE s.prod_id=p.prod_id
AND p.prod_category='Electronics' AND s.time_id=to_DA
TE('10-OCT-2000')
GROUP BY s.channel_id;
CHANNEL_ID SLOPE INTCPT RSQR COUNT AVGLISTP AVGQSOLD
---------- ---
------- ---------- ---------- ---------- ---------- ----------
2 0 1 1 39 466.656667
1
3 0 1 1 60 459.99 1
4 0 1 1
19 526.305789 1
Instead of counting how often a given event occurs (for example, how often someone has purchased milk at the grocery), frequent itemsets provides a mechanism for counting how often multiple events occur together (for example, how often someone has purchased both milk and cereal together at the grocery store).
The input to the frequent-itemsets operation is a set of data that represents collections of items (itemsets). Some examples of itemsets could be all of the products that a given customer purchased in a single trip to the grocery store (commonly called a market basket), the web-page s that a user accessed in a single session, or the financial services that a given customer utilizes. The notion of a frequent itemse t is to find those itemsets that occur most often. If you apply the frequent-itemset operator to a grocery store's point-of-sale data , you might, for example, discover that milk and bananas are the most commonly bought pair of items.
Frequent itemsets have th us been used in business intelligence environments for many years, with the most common one being for market basket analysis in the r etail industry. Frequent itemsets are integrated with the database, operating on top of relational tables and accessed through SQL. T his integration provides a couple of key benefits:
Applications that previously relied on frequent items et operations now benefit from significantly improved performance as well as simpler implementation.
SQL -based applications that did not previously use frequent itemsets can now be easily extended to take advantage of this functionality.
Frequent itemsets analysis is performed with the PL/SQL package DBMS_FREQUENT_ITEMSETS. See PL/SQL Packages and Types Reference for more information.
Oracle introduces a set of SQL statistical functions and a statistics package, DBMS_STAT_FUNCS. This section lists
some of the new functions along with basic syntax.
See PL /SQL Packages and Types Reference for detailed information about the DBMS_STAT_FUNCS package and Oracle Database SQL Reference for syntax and semantics. p>
You can calculate the following descriptive statistics:
Median of a Data Set
Median (expr) [OVER (query_partition_clause)]
Mode of a Data Set
STATS_MODE (expr)
You can calculate the following descriptive statistics:
One-Sample T-Test
STATS_T_TEST_ONE (expr1, expr2 (a constant) [, return_value])
Paired-Samples T-Test
STATS_T_TEST_PAIRED (expr1, expr2 [, return_value])
Independent-Samp les T-Test. Pooled Variances
STATS_T_TEST_INDEP (expr1, expr2 [, return_value])
Independent-Samples T-Test, Unpooled Variances
STATS_T_TEST_INDEPU (expr1, expr2 [, return_v alue])
The F-Test
STATS_F_TEST (expr1, expr2 [, return_value])
One-Way ANOVA
STATS_ONE_WAY_ANOVA (expr1, expr2 [, return_value])li>
You can calculate crosstab statistics using the following syntax:
STATS_CROSSTAB (expr1, expr2 [, return_value])
Can return any one of the following:
Observed value of chi-squared
Significance of observed chi-squared
Degree of freedom for chi-squared
Phi coefficient, Cramer's V statistic
Contingency coefficient
Cohen's Kappa
You can calculate hypothesis statistics using the following syntax:
ST ATS_BINOMIAL_TEST (expr1, expr2, p [, return_value])
Binomial Test/Wilcoxon Signed Ranks Test
< pre xml:space="preserve">STATS_WSR_TEST (expr1, expr2 [, return_value])Mann-Whitney Test
STATS_MW_TEST (expr1, expr2 [, return_value])
Kolmogorov-Smirnov Test
STATS_KS_TEST (expr1, expr2 [, return_value])
You can calculate the following parametric statistics:
Spearman's rho Coefficient
Kendall's tau-b Coefficient
CORR_K (expr1, expr2 [, return_value])
In addition to the functions, this release has a ne
w PL/SQL package, DBMS_STAT_FUNCS. It contains the descriptive statistical function SUMMARY along with func
tions to support distribution fitting. The SUMMARY function summarizes a numerical column of a table with a variety of d
escriptive statistics. The five distribution fitting functions support normal, uniform, Weibull, Poisson, and exponential distributio
ns.
For a giv
en expression, the WIDTH_BUCKET function returns the bucket number that the result of this expression will be assigned a
fter it is evaluated. You can generate equiwidth histograms with this function. Equiwidth histograms divide data sets into buckets wh
ose interval size (highest value to lowest value) is equal. The number of rows held by each bucket will vary. A related function,
Equiwidth histograms can be generated only for numeric, date or datetime types. S
o the first three parameters should be all numeric expressions or all date expressions. Other types of expressions are not allowed. I
f the first parameter is NULL, the result is NULL. If the second or the third parameter is NULL, an error message is returned, as a NULL value cannot denote any end point (or any point) for a range in a date or num
eric value dimension. The last parameter (number of buckets) should be a numeric expression that evaluates to a positive integer valu
e; 0, NULL, or a negative value will result in an error.
Buckets are numbered from 0 to (n+1). Bucke
t 0 holds the count of values less than the minimum. Bucket(n+1) holds the count of values greater than or equal to the
maximum specified value.
The WIDTH_BUCKET takes four expressions as parameters. The first
parameter is the expression that the equiwidth histogram is for. The second and third parameters are expressions that denote the end
points of the acceptable range for the first parameter. The fourth parameter denotes the number of buckets.
WIDTH_BUCKET(expression, minval expression, maxval expression, num buckets)
Consider the following data from table < code>customers, that shows the credit limits of 17 customers. This data is gathered in the query shown in Example 21-19.
CUST_ID CUST_CREDIT_LIMIT
--------- -----------------
10346 70
00
35266 7000
41496 15000
35225 11000
3424 9000
28344
1500
31112 7000
8962 1500
15192 3000
21380 5000
36651 1500
30420 5000
8270 3000
17268 11000
14459
11000
13808 5000
32497 1500
100977 9000
102077 3000
1030
66 10000
101784 5000
100421 11000
102343 3000
In the table
customers, the column cust_credit_limit contains values between 1500 and 15000, and we can assign the valu
es to four equiwidth buckets, numbered from 1 to 4, by using WIDTH_BUCKET (cust_credit_limit, 0, 20000, 4). Ideally each
bucket is a closed-open interval of the real number line, for example, bucket number 2 is assigned to scores between 5000.0000 and 9
999.9999..., sometimes denoted [5000, 10000) to indicate that 5,000 is included in the interval and 10,000 is excluded. To accommodat
e values outside the range [0, 20,000), values less than 0 are assigned to a designated underflow bucket which is numbered 0, and val
ues greater than or equal to 20,000 are assigned to a designated overflow bucket which is numbered 5 (num buckets + 1 in general). Se
e Figure 21-3 for a graphical illustration of how the buckets are assigned.
You can specify the bounds in the reverse order, for example, WIDTH_BUCKET (cust_credit_l
imit, 20000, 0, 4). When the bounds are reversed, the buckets will be open-closed inter
vals. In this example, bucket number 1 is (15000,20000], bucket number 2 is (10000,15000], and bucket numbe
r 4, is (0,5000]. The overflow bucket will be numbered 0 (20000, +infinity), and the underflow
bucket will be numbered 5 (-infinity, 0].
It is an error if the bucket count parameter is 0 or nega tive.
Example 21-19 WIDTH_BUCKET
The following query shows the bucket numbers for the credit limits in the customers table for both cases where the boundaries are specified in regular or reverse order. We use a range of 0 to 20,000.
SELECT cust_id, cust_credit_limit,
WIDTH_BUCKET(cust_credit_limit,0,20000,4) AS WIDTH_BUCKET_UP,
WIDTH_BUCKET(cust_credit_limit,
20000, 0, 4) AS WIDTH_BUCKET_DOWN
FROM customers WHERE cust_city = 'Marshal';
CUST_ID CUST_CREDIT_LIMIT WIDTH_BUCKET_UP WIDTH_BUC
KET_DOWN
---------- ----------------- --------------- -----------------
10346 7000 2
3
35266 7000 2 3
41496 15000 4 2
35225 11000 3 2
3424 9000 2 3
28344
1500 1 4
31112 7000 2 3
8962
1500 1 4
15192 3000 1 4
21380 50
00 2 4
36651 1500 1 4
30420 5000
2 4
8270 3000 1 4
17268 11000
3 2
14459 11000 3 2
13808 5000 2
4
32497 1500 1 4
100977 9000 2
3
102077 3000 1 4
103066 10000 3
3
101784 5000 2 4
100421 11000 3 2
1
02343 3000 1 4
Oracle offers a facility for creating your own functions, called user-defined aggregate functions. These functions are writ ten in programming languages such as PL/SQL, Java, and C, and can be used as analytic functions or aggregates in materialized views. See Oracle Data Cartridge Developer's Guide for further information regarding syntax and restrictions.
The advantages of these functions are:
Highly com plex functions can be programmed using a fully procedural language.
Higher scalability than other techni ques when user-defined functions are programmed for parallel processing.
Object datatypes can be process ed.
As a simple example of a user-defined aggregate function, consider the skew statistic. This calculation measures
if a data set has a lopsided distribution about its mean. It will tell you if one tail of the distribution is significantly larger t
han the other. If you created a user-defined aggregate called udskew and applied it to the credit limit data in the prio
r example, the SQL statement and results might look like this:
SELECT USERDEF_SKEW(cust_credit_limit) F ROM customers WHERE cust_city='Marshal'; USERDEF_SKEW ============ 0.583891
Before building user-defined aggregate funct
ions, you should consider if your needs can be met in regular SQL. Many complex calculations are possible directly in SQL, particular
ly by using the CASE expression.
Staying with regular SQL will enable simpler development, and many query operati ons are already well-parallelized in SQL. Even the earlier example, the skew statistic, can be created using standard, albeit lengthy , SQL.
Oracle now supports simple and searched CASE statements. CASE
code> statements are similar in purpose to the DECODE statement, but they offer more flexibility and logical power. They
are also easier to read than traditional DECODE statements, and offer better performance as well. They are commonly use
d when breaking categories into buckets like age (for example, 20-29, 30-39, and so on). The syntax for simple statements is:
The synta x for searched statements is:
WHEN condition THEN return_expr [, WHEN condition THEN return_expr]... < /pre>You can specify only 255 arguments and each
WHEN...THENpair counts as two arguments. For a wor karound to this limit, see Oracle Database S QL Reference.Example 21-20 CASE
Suppose you wanted to find the average salary of all employees in the c ompany. If an employee's salary is less than $2000, you want the query to use $2000 instead. Without a
CASEstatement, y ou would have to write this query as follows,SELECT AVG(foo(e.sal)) FROM emps e;In this,
foois a function that returns its input if the input is greater than 2000, and returns 2000 otherwise. The query has pe rformance implications because it needs to invoke a function for each row. Writing custom functions can also add to the development l oad.Using
CASEexpressions in the database without PL/SQL, this query can be rewritten as:SELECT AVG(CASE when e.sal > 2000 THEN e.sal ELSE 2000 end) FROM emps e;Using a
CASEexpression lets you avoid developing custom functions and can also perform faster.Creating Histograms With User-Defi ned Buckets
You can use the
CASEstatement when you want to obta in histograms with user-defined buckets (both in number of buckets and width of each bucket). The following are two examples of histo grams created withCASEstatements. In the first example, the histogram totals are shown in multiple columns and a singl e row is returned. In the second example, the histogram is shown with a label column and a single column for totals, and multiple row s are returned.Example 21-21 Histogram Example 1
SELECT SUM(CASE WHEN cust_credit_limit B ETWEEN 0 AND 3999 THEN 1 ELSE 0 END) AS "0-3999", SUM(CASE WHEN cust_credit_limit BETWEEN 4000 AND 7999 THEN 1 ELSE 0 END) AS " 4000-7999", SUM(CASE WHEN cust_credit_limit BETWEEN 8000 AND 11999 THEN 1 ELSE 0 END) AS "8000-11999", SUM(CASE WHEN cust_credit_l imit BETWEEN 12000 AND 16000 THEN 1 ELSE 0 END) AS "12000-16000" FROM customers WHERE cust_city = 'Marshal'; 0-3999 4000-799 9 8000-11999 12000-16000 ---------- ---------- ---------- ----------- 8 7 7 1Example 21-22 Histogram Example 2
SELECT (CASE WHEN cust_credit_limit BET WEEN 0 AND 3999 THEN ' 0 - 3999' WHEN cust_credit_limit BETWEEN 4000 AND 7999 THEN ' 4000 - 7999' WHEN cust_credit_limit BET WEEN 8000 AND 11999 THEN ' 8000 - 11999' WHEN cust_credit_limit BETWEEN 12000 AND 16000 THEN '12000 - 16000' END) AS BUCKET, COUNT(*) AS Count_in_Group FROM customers WHERE cust_city = 'Marshal' GROUP BY (CASE WHEN cust_credit_limit BETWEEN 0 AND 3999 THEN ' 0 - 3999' WHEN cust_credit_limit BETWEEN 4000 AND 7999 THEN ' 4000 - 7999' WHEN cust_credit_limit BETWEEN 8000 AND 11999 THEN ' 8000 - 11999' WHEN cust_credit_limit BETWEEN 12000 AND 16000 THEN '12000 - 16000' END); BUCKET COUNT_IN_GROUP ---------- --- -------------- 0 - 3999 8 4000 - 7999 7 8000 - 11999 7 12000 - 16000 1
Data is <
/a>normally stored in sparse form. That is, if no value exists for a given combination of di
mension values, no row exists in the fact table. However, you may want to view the data in dense form, with rows for all combination
of dimension values displayed even when no fact data exist for them. For example, if a product did not sell during a particular time
period, you may still want to see the product for that time period with zero sales value next to it. Moreover, time series calculatio
ns can be performed most easily when data is dense along the time dimension. This is because dense data will fill a consistent number
of rows for each period, which in turn makes it simple to use the analytic windowing functions with physical offsets. Data densifica
tion is the process of converting spare data into dense form. To overcome the problem of sparsity, you can use a partitioned outer jo
in to fill the gaps in a time series or any other dimension. Such a join extends the conventional outer join syntax by applying the o
uter join to each logical partition defined in a query. Oracle logically partitions the rows in your query based on the expression yo
u specify in the PARTITION BY clause. The result of a partitioned outer join is a UNION of the
outer joins of each of the partitions in the logically partitioned table with the table on the other side of the join. Note that you
can use this type of join to fill the gaps in any dimension, not just the time dimension. Most of the examples here focus on the tim
e dimension because it is the dimension most frequently used as a basis for comparisons.
The syntax for partitioned outer join extends the ANSI SQL JOIN clause with the phrase
PARTITION BY followed by an expression list. The expressions in the list specify the group to which the outer joi
n is applied. The following are the two forms of syntax normally used for partitioned outer join:
SELEC
T .....
FROM table_reference
PARTITION BY (expr [, expr ]... )
RIGHT OUTER JOIN table_reference
SELECT .....
FROM table_reference
LE
FT OUTER JOIN table_reference
PARTITION BY {expr [,expr ]...)
Note that FULL OUTER JOIN
code> is not supported with a partitioned outer join.
A typical situation with a spa rse dimension is shown in the following example, which computes the weekly sales and year-to-date sales for the product Bounce for we eks 20-30 in 2000 and 2001:
SELECT SUBSTR(p.Prod_Name,1,15) Product_Name, t.Calendar_Year Year,
t.Cale
ndar_Week_Number Week, SUM(Amount_Sold) Sales
FROM Sales s, Times t, Products p
WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_id
AND
p.Prod_name IN ('Bounce') AND t.Calendar_Year IN (2000,2001) AND
t.Calendar_Week_Number BETWEEN 20 AND 30
GROUP BY p.Prod_Name
, t.Calendar_Year, t.Calendar_Week_Number;
PRODUCT_NAME YEAR WEEK SALES
--------------- ---------- ---------- --
--------
Bounce 2000 20 801
Bounce 2000 21 4062.24
Bounce 2000
22 2043.16
Bounce 2000 23 2731.14
Bounce 2000 24 4419.36
Bounce
2000 27 2297.29
Bounce 2000 28 1443.13
Bounce 2000 29 1927.3
8
Bounce 2000 30 1927.38
Bounce 2001 20 1483.3
Bounce 2001
21 4184.49
Bounce 2001 22 2609.19
Bounce 2001 23 1416.95
Bounce
2001 24 3149.62
Bounce 2001 25 2645.98
Bounce 2001 27 2125.12
Bounc
e 2001 29 2467.92
Bounce 2001 30 2620.17
In this example, we would ex pect 22 rows of data (11 weeks each from 2 years) if the data were dense. However we get only 18 rows because weeks 25 and 26 are mis sing in 2000, and weeks 26 and 28 in 2001.
We can take the sparse data of the preceding query and do a partitioned outer join wi
th a dense set of time data. In the following query, we alias our original query as v and we select data from the
times table, which we alias as t. Here we retrieve 22 rows because there are no gaps in the series. The four adde
d rows each have 0 as their Sales value set to 0 by using the NVL function.
SELECT Product
_Name, t.Year, t.Week, NVL(Sales,0) dense_sales
FROM
(SELECT SUBSTR(p.Prod_Name,1,15) Product_Name,
t.Calendar_Year Year, t.Calend
ar_Week_Number Week, SUM(Amount_Sold) Sales
FROM Sales s, Times t, Products p
WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_
id AND
p.Prod_name IN ('Bounce') AND t.Calendar_Year IN (2000,2001) AND
t.Calendar_Week_Number BETWEEN 20 AND 30
GROUP BY p.P
rod_Name, t.Calendar_Year, t.Calendar_Week_Number) v
PARTITION BY (v.Product_Name)
RIGHT OUTER JOIN
(SELECT DISTINCT Calendar_Week_N
umber Week, Calendar_Year Year
FROM Times
WHERE Calendar_Year IN (2000, 2001)
AND Calendar_Week_Number BETWEEN 20 AND 30) t
ON
(v.week = t.week AND v.Year = t.Year)
ORDER BY t.year, t.week;
PRODUCT_NAME YEAR WEE
K DENSE_SALES
--------------- ---------- ---------- -----------
Bounce 2000 20 801
Bounce
2000 21 4062.24
Bounce 2000 22 2043.16
Bounce 2000 23 2731.14
Bo
unce 2000 24 4419.36
Bounce 2000 25 0
Bounce 2000
26 0
Bounce 2000 27 2297.29
Bounce 2000 28 1443.13
Bounce
2000 29 1927.38
Bounce 2000 30 1927.38
Bounce 2001 20 1483.3
B
ounce 2001 21 4184.49
Bounce 2001 22 2609.19
Bounce 2001
23 1416.95
Bounce 2001 24 3149.62
Bounce 2001 25 2645.98
Bounce
2001 26 0
Bounce 2001 27 2125.12
Bounce 2001 28 0
Bounce 2001 29 2467.92
Bounce 2001 30 2620.17
Note that in this que
ry, a WHERE condition was placed for weeks between 20 and 30 in the inline view for the time dimension. This was introdu
ced to keep the result set small.
N-dimensional data is typically displayed as a dense 2-dimensional cross tab of (n - 2) page dimensions. This requires that all dimension values for the two dimensions appearing in the cross tab be filled in. The following is another example where the part itioned outer join capability can be used for filling the gaps on two dimensions:
WITH v1 AS
(SELECT p
.prod_id, country_id, calendar_year,
SUM(quantity_sold) units, SUM(amount_sold) sales
FROM sales s, products p, customers c, ti
mes t
WHERE s.prod_id in (147, 148) AND t.time_id = s.time_id AND
c.cust_id = s.cust_id AND p.prod_id = s.prod_id
GROUP BY p.
prod_id, country_id, calendar_year),
v2 AS -- countries to use for densifications.
(SELECT DISTINCT countr
y_id
FROM customers
WHERE country_id IN (52782, 52785, 52786, 52787, 52788)),
v3 AS -- years to use fo
r densifications
(SELECT DISTINCT calendar_year FROM times)
SELECT v4.prod_id, v4.country_id, v3.calendar_year, units, sales
FROM
(SELECT prod_id, v2.country_id, calendar_year, units, sales
FROM v1 PARTITION BY (prod_id)
RIGHT OUTER JOIN v2
--densifies on country
ON (v1.country_id = v2.country_id)) v4
PARTITION BY (prod_id,country_id)
RIGHT OUTER JOIN v3
--densifies on year
ON (v4.calendar_year = v3.calendar_year)
ORDER BY 1, 2, 3;
In this query, the WITH
sub-query factoring clause v1, summarizes sales data at the product, country, and year level. This result is sparse but
users may want to see all the country, year combinations for each product. To achieve this, we take each partition of v1 based on product values and outer join it on the country dimension first. This will give us all values of country for each product.
We then take that result and partition it on product and country values and then outer join it on time dimension. This will give us
all time values for each product and country combination.
PROD_ID COUNTRY_ID CALENDAR_YEAR UNITS
SALES
---------- ---------- ------------- ---------- ----------
147 52782 1998
147 52782
1999 29 209.82
147 52782 2000 71 594.36
147 52782 2001 345
2754.42
147 52782 2002
147 52785 1998 1 7.99
147 52785
1999
147 52785 2000
147 52785 2001
147 52785 2002
147
52786 1998 1 7.99
147 52786 1999
147 52786 2000 2 15.
98
147 52786 2001
147 52786 2002
147 52787 1998
147 52787
1999
147 52787 2000
147 52787 2001
147 52787 2002
147
52788 1998
147 52788 1999
147 52788 2000 1 7.99
147
52788 2001
147 52788 2002
148 52782 1998 139 4046.67
148 5
2782 1999 228 5362.57
148 52782 2000 251 5629.47
148 52782 200
1 308 7138.98
148 52782 2002
148 52785 1998
148 52785 1999
148 52785 2000
148 52785 2001
148 52785 2002
148 52786
1998
148 52786 1999
148 52786 2000
148 52786 2001
148
52786 2002
148 52787 1998
148 52787 1999
148 52787 2000
148 52787 2001
148 52787 2002
148 52788 1998 4 117.23
148 52788 1999
148 52788 2000
148 52788 2001
148 52788 2
002
An inventory table typically tracks quantity of units available for various products. This table is sparse: it only stores a row for a product wh en there is an event. For a sales table, the event is a sale, and for the inventory table, the event is a change in quantity availabl e for a product. For example, consider the following inventory table:
CREATE TABLE invent_table (
produ
ct VARCHAR2(10),
time_id DATE,
quant NUMBER);
INSERT INTO invent_table VALUES
('bottle', TO_DATE('01/04/01', 'DD/MM/YY'), 10);
INSE
RT INTO invent_table VALUES
('bottle', TO_DATE('06/04/01', 'DD/MM/YY'), 8);
INSERT INTO invent_table VALUES
('can', TO_DATE('01/04/
01', 'DD/MM/YY'), 15);
INSERT INTO invent_table VALUES
('can', TO_DATE('04/04/01', 'DD/MM/YY'), 11);
The inventory table now has the following rows:
PRODUCT TIME_ID QUANT
---------- --------- -----
bottle 01-APR-01
10
bottle 06-APR-01 8
can 01-APR-01 15
can 04-APR-01 11
For reporting purposes, users may
want to see this inventory data differently. For example, they may want to see all values of time for each product. This can be acco
mplished using partitioned outer join. In addition, for the newly inserted rows of missing time periods, users may want to see the va
lues for quantity of units column to be carried over from the most recent existing time period. The latter can be accomplished using
analytic window function LAST_VALUE value. Here is the query and the desired output:
WITH
v1 AS
(SELECT time_id
FROM times
WHERE times.time_id BETWEEN
TO_DATE('01/04/01', 'DD/MM/YY')
AND TO_DATE('07/04/01', 'DD/M
M/YY'))
SELECT product, time_id, quant quantity,
LAST_VALUE(quant IGNORE NULLS)
OVER (PARTITION BY product ORDER BY time_id)
repeated_quantity
FROM
(SELECT product, v1.time_id, quant
FROM invent_table PARTITION BY (product)
RIGHT OUTER JOIN v1
ON (
v1.time_id = invent_table.time_id))
ORDER BY 1, 2;
The inner query computes a partitioned outer join on time within each
product. The inner query densifies the data on the time dimension (meaning the time dimension will now have a row for each day of the
week). However, the measure column quantity will have nulls for the newly added rows (see the output in the column
The outer query uses the analytic function LAST_VALUE. Applying this
function partitions the data by product and orders the data on the time dimension column (time_id). For each row, the fu
nction finds the last non-null value in the window due to the option IGNORE NULLS, which you can use with b
oth LAST_VALUE and FIRST_VALUE. We see the desired output in the column repeated_quantity in t
he following output:
PRODUCT TIME_ID QUANTITY REPEATED_QUANTITY
---------- --------- -------- ----
-------------
bottle 01-APR-01 10 10
bottle 02-APR-01 10
bottle 03-APR-01
10
bottle 04-APR-01 10
bottle 05-APR-01 10
bottle 0
6-APR-01 8 8
bottle 07-APR-01 8
can 01-APR-01 15 15
c
an 02-APR-01 15
can 03-APR-01 15
can 04-APR-01 11
11
can 05-APR-01 11
can 06-APR-01 11
can 07-APR-01
11
Examples in previous section illustrate how to use partitioned outer join to fill gaps in one or more dimensions. However, the r
esult sets produced by partitioned outer join have null values for columns that are not included in the PARTITION
BY list. Typically, these are measure columns. Users can make use of analytic SQL functions to replace those null values with
a non-null value.
For example, the following query computes monthly totals for products 64MB Memory card and DVD-R Discs (prod
uct IDs 122 and 136) for the year 2000. It uses partitioned outer join to densify data for all months. For the missing months, it the
n uses the analytic SQL function AVG to compute the sales and units to be the average of the months when the product was
sold.
If working in SQL*Plus, the following two commands will wrap the column headings for greater readability of results:
col computed_units heading 'Computed|_units'
col computed_sales heading 'Computed|_sales'
WITH V AS
(SELECT substr(p.prod_name,1,12) prod_name, calendar_month_desc,
SUM(quantity_sold) units, SUM(amount_sold) sales
FROM sales
s, products p, times t
WHERE s.prod_id in (122,136) AND calendar_year = 2000
AND t.time_id = s.time_id
AND p.prod_id =
s.prod_id
GROUP BY p.prod_name, calendar_month_desc)
SELECT v.prod_name, calendar_month_desc, units, sales,
NVL(units, AVG(units
) OVER (partition by v.prod_name)) computed_units,
NVL(sales, AVG(sales) OVER (partition by v.prod_name)) computed_sales
FROM
(S
ELECT DISTINCT calendar_month_desc
FROM times
WHERE calendar_year = 2000) t
LEFT OUTER JOIN V
PARTITION BY (prod_name)
USING (calendar_month_desc);
computed computed
PROD_NAME CALENDAR UNITS
SALES _units _sales
------------ -------- ---------- ---------- ---------- ----------
64MB Memory 2000-01 112 41
29.72 112 4129.72
64MB Memory 2000-02 190 7049 190 7049
64MB Memory 2000-03 47 172
4.98 47 1724.98
64MB Memory 2000-04 20 739.4 20 739.4
64MB Memory 2000-05 47 1738
.24 47 1738.24
64MB Memory 2000-06 20 739.4 20 739.4
64MB Memory 2000-07
72.6666667 2686.79
64MB Memory 2000-08 72.6666667 2686.79
64MB Memory 2000-09
72.6666667 2686.79
64MB Memory 2000-10 72.6666667 2686.79
64MB Memory 2000-11
72.6666667 2686.79
64MB Memory 2000-12 72.6666667 2686.79
DVD-R Discs, 2000-01 167 3683.5
167 3683.5
DVD-R Discs, 2000-02 152 3362.24 152 3362.24
DVD-R Discs, 2000-03 188 4148.02
188 4148.02
DVD-R Discs, 2000-04 144 3170.09 144 3170.09
DVD-R Discs, 2000-05 189 4164.87
189 4164.87
DVD-R Discs, 2000-06 145 3192.21 145 3192.21
DVD-R Discs, 2000-07
124.25 2737.71
DVD-R Discs, 2000-08 124.25 2737.71
DVD-R Discs, 2000-09 1 18.91
1 18.91
DVD-R Discs, 2000-10 124.25 2737.71
DVD-R Discs, 2000-11 1
24.25 2737.71
DVD-R Discs, 2000-12 8 161.84 8 161.84
Densification is not just for reporting purpose. It also enables certain types of calculations, especially, time series calculations. Time series calculations are easier whe n data is dense along the time dimension. Dense data has a consistent number of rows for each time periods which in turn make it simp le to use analytic window functions with physical offsets.
To illustrate, let's first take the example on
"Filling Gaps in Data", and let's add an analytic function to that query. In the following enhanced version, we calculate weekly
year-to-date sales alongside the weekly sales. The NULL values that the partitioned outer join inserts in making the tim
e series dense are handled in the usual way: the SUM function treats them as 0's.
SELECT P
roduct_Name, t.Year, t.Week, NVL(Sales,0) Current_sales,
SUM(Sales)
OVER (PARTITION BY Product_Name, t.year ORDER BY t.week) Cum
ulative_sales
FROM
(SELECT SUBSTR(p.Prod_Name,1,15) Product_Name, t.Calendar_Year Year,
t.Calendar_Week_Number Week, SUM(Amount_
Sold) Sales
FROM Sales s, Times t, Products p
WHERE s.Time_id = t.Time_id AND
s.Prod_id = p.Prod_id AND p.Prod_name IN ('Bounc
e') AND
t.Calendar_Year IN (2000,2001) AND
t.Calendar_Week_Number BETWEEN 20 AND 30
GROUP BY p.Prod_Name, t.Calendar_Year, t.
Calendar_Week_Number) v
PARTITION BY (v.Product_Name)
RIGHT OUTER JOIN
(SELECT DISTINCT
Calendar_Week_Number Week, Calendar_Year Yea
r
FROM Times
WHERE Calendar_Year in (2000, 2001)
AND Calendar_Week_Number BETWEEN 20 AND 30) t
ON (v.week = t.week AND v.Year = t.
Year)
ORDER BY t.year, t.week;
PRODUCT_NAME YEAR WEEK CURRENT_SALES CUMULATIVE_SALES
--------------- ---------- -----
----- ------------- ----------------
Bounce 2000 20 801 801
Bounce 2000
21 4062.24 4863.24
Bounce 2000 22 2043.16 6906.4
Bounce
2000 23 2731.14 9637.54
Bounce 2000 24 4419.36 14056.9
Bounce
2000 25 0 14056.9
Bounce 2000 26 0 14056.9
Bounce
2000 27 2297.29 16354.19
Bounce 2000 28 1443.13 17797.32
Bounce
2000 29 1927.38 19724.7
Bounce 2000 30 1927.38 21652.08
Bo
unce 2001 20 1483.3 1483.3
Bounce 2001 21 4184.49 5667.
79
Bounce 2001 22 2609.19 8276.98
Bounce 2001 23 1416.95
9693.93
Bounce 2001 24 3149.62 12843.55
Bounce 2001 25 2645.98
15489.53
Bounce 2001 26 0 15489.53
Bounce 2001 27 2125.12
17614.65
Bounce 2001 28 0 17614.65
Bounce 2001 29 24
67.92 20082.57
Bounce 2001 30 2620.17 22702.74
How do we use this feature to compare values across time periods? Specifically, how do we calculate a year-over-year sales comparison at the week level? The following query returns on the same row, for each product, the year-to-date sales for each week of 2001 with that of 2000.
Note that in this example we start with a
col Weekly_y
td_sales_prior_year heading 'Weekly_ytd|_sales_|prior_year'
WITH v AS
(SELECT SUBSTR(p.Prod_Name,1,6) Prod, t.Calendar_Year Year
,
t.Calendar_Week_Number Week, SUM(Amount_Sold) Sales
FROM Sales s, Times t, Products p
WHERE s.Time_id = t.Time_id AND
s.Prod_id = p.Prod_id AND p.Prod_name in ('Y Box') AND
t.Calendar_Year in (2000,2001) AND
t.Calendar_Week_Number BETWEE
N 30 AND 40
GROUP BY p.Prod_Name, t.Calendar_Year, t.Calendar_Week_Number)
SELECT Prod , Year, Week, Sales,
Weekly_ytd_sales, We
ekly_ytd_sales_prior_year
FROM
(SELECT Prod, Year, Week, Sales, Weekly_ytd_sales,
LAG(Weekly_ytd_sales, 1) OVER
(PART
ITION BY Prod , Week ORDER BY Year) Weekly_ytd_sales_prior_year
FROM
(SELECT v.Prod Prod , t.Year Year, t.Week Week,
NVL(v.Sales,0) Sales, SUM(NVL(v.Sales,0)) OVER
(PARTITION BY v.Prod , t.Year ORDER BY t.week) weekly_ytd_sales
FROM
v
PARTITION BY (v.Prod )
RIGHT OUTER JOIN
(SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year Year
FROM Times
WHERE Calendar_Year IN (2000, 2001)) t
ON (v.week = t.week AND v.Year = t.Year)
) dense_sales
) ye
ar_over_year_salesWHERE Year = 2001 AND Week BETWEEN 30 AND 40
ORDER BY 1, 2, 3;
Weekly_ytd
_sales_
PROD YEAR WEEK SALES WEEKLY_YT
D_SALES prior_year
------ ---------- ---------- ---------- ---------------- ----------
Y Box 2001 30 7877.45
7877.45 0
Y Box 2001 31 13082.46 20959.91 1537.35
Y Box 2001 32 11569.02
32528.93 9531.57
Y Box 2001 33 38081.97 70610.9 39048.69
Y Box 2001 34 33109.65
103720.55 69100.79
Y Box 2001 35 0 103720.55 71265.35
Y Box 2001 36 4169
.3 107889.85 81156.29
Y Box 2001 37 24616.85 132506.7 95433.09
Y Box 2001 38 377
39.65 170246.35 107726.96
Y Box 2001 39 284.95 170531.3 118817.4
Y Box 2001 40
10868.44 181399.74 120969.69
In the FROM clause of the in-line view dense_sales, we use
a partitioned outer join of aggregate view v and time view t to fill gaps in the sales data along the time
dimension. The output of the partitioned outer join is then processed by the analytic function SUM ... OVER to compute
the weekly year-to-date sales (the weekly_ytd_sales column). Thus, the view dense_sales computes the year-t
o-date sales data for each week, including those missing in the aggregate view s. The in-line view year_over_year_sales
then computes the year ago weekly year-to-date sales using the LAG function. The LAG function labeled weekly_ytd_sales_prior_year specifies a PARTITION BY clause that pairs rows for the same week of ye
ars 2000 and 2001 into a single partition. We then pass an offset of 1 to the LAG function to get the weekly year to dat
e sales for the prior year.The outermost query block selects data from year_over_year_sales with the condition yr = 2001
, and thus the query returns, for each product, its weekly year-to-date sales in the specified weeks of years 2001 and 2000.
W hile the prior example shows us a way to create comparisons for a single time level, it would be even more useful to handle multiple time levels in a single query. For example, we could compare sales versus the prior period at the year, quarter, month and day levels . How can we create a query which performs a year-over-year comparison of year-to-date sales for all levels of our time hierarchy?
We will take several steps to perform this task. The goal is a single query with comparisons at the day, week, month, quarter, a nd year level. The steps are as follows:
We will create a view called cube_prod_time
, which holds a hierarchical cube of sales aggregated across times and products.
Then we w
ill create a view of the time dimension to use as an edge of the cube. The time edge, which holds a complete set of dates, will be pa
rtitioned outer joined to the sparse data in the view cube_prod_time.
Finally, for maximum performance,
we will create a materialized view, mv_prod_time, built using the same definition as cube_prod_time.
For more information regarding hierarchical cubes, see Chapter 20, " SQL for Aggregation in Data Warehouses". The materialized view is defined using the following statement:
Step 1 Create the hierarchical cube view
The materialized view shown in the following may al ready exist in your system; if not, create it now. If you must generate it, please note that we limit the query to just two products to keep processing time short:
CREATE OR REPLACE VIEW cube_prod_time AS
SELECT
(CASE
WHEN ((GRO
UPING(calendar_year)=0 )
AND (GROUPING(calendar_quarter_desc)=1 ))
THEN (TO_CHAR(calendar_year) || '_0')
WHEN ((G
ROUPING(calendar_quarter_desc)=0 )
AND (GROUPING(calendar_month_desc)=1 ))
THEN (TO_CHAR(calendar_quarter_desc) || '_1'
)
WHEN ((GROUPING(calendar_month_desc)=0 )
AND (GROUPING(t.time_id)=1 ))
THEN (TO_CHAR(calendar_month_desc) || '_2
')
ELSE (TO_CHAR(t.time_id) || '_3')
END) Hierarchical_Time,
calendar_year year, calendar_quarter_desc quarter,
calendar_m
onth_desc month, t.time_id day,
prod_category cat, prod_subcategory subcat, p.prod_id prod,
GROUPING_ID(prod_category, prod_subca
tegory, p.prod_id,
calendar_year, calendar_quarter_desc, calendar_month_desc,t.time_id) gid,
GROUPING_ID(prod_category, prod_su
bcategory, p.prod_id) gid_p,
GROUPING_ID(calendar_year, calendar_quarter_desc,
calendar_month_desc, t.time_id) gid_t,
SUM(amo
unt_sold) s_sold, COUNT(amount_sold) c_sold, COUNT(*) cnt
FROM SALES s, TIMES t, PRODUCTS p
WHERE s.time_id = t.time_id AND
p.prod_
name IN ('Bounce', 'Y Box') AND s.prod_id = p.prod_id
GROUP BY
ROLLUP(calendar_year, calendar_quarter_desc, calendar_month_desc, t.
time_id),
ROLLUP(prod_category, prod_subcategory, p.prod_id);
Because this view is limited to two products, it returns
just over 2200 rows. Note that the column Hierarchical_Time contains string representations of time from all levels of t
he time hierarchy. The CASE expression used for the Hierarchical_Time column appends a marker (_0, _1, ...)
to each date string to denote the time level of the value. A _0 represents the year level, _1 is quarters, _2 is months, and _3 is d
ay. Note that the GROUP BY clause is a concatenated ROLLUP which specifies the rollup hierarch
y for the time and product dimensions. The GROUP BY clause is what determines the hierarchical cube content
s.
Step 2 Create the view edge_time, which is a complete set of date values< /font>
edge_time is the source for filling time gaps in the hierarchical cube using a partitioned outer
join. The column Hierarchical_Time in edge_time will be used in a partitioned join with the Hierarchi
cal_Time column in the view cube_prod_time. The following statement defines edge_time:
CREATE OR REPLACE VIEW edge_time AS
SELECT
(CASE
WHEN ((GROUPING(calendar_year)=0 )
AND (GROUPING(cal
endar_quarter_desc)=1 ))
THEN (TO_CHAR(calendar_year) || '_0')
WHEN ((GROUPING(calendar_quarter_desc)=0 )
AND (GRO
UPING(calendar_month_desc)=1 ))
THEN (TO_CHAR(calendar_quarter_desc) || '_1')
WHEN ((GROUPING(calendar_month_desc)=0 )
AND (GROUPING(time_id)=1 ))
THEN (TO_CHAR(calendar_month_desc) || '_2')
ELSE (TO_CHAR(time_id) || '_3')
END) Hier
archical_Time,
calendar_year yr, calendar_quarter_number qtr_num,
calendar_quarter_desc qtr, calendar_month_number mon_num,
calendar_month_desc mon, time_id - TRUNC(time_id, 'YEAR') + 1 day_num,
time_id day,
GROUPING_ID(calendar_year, calendar_quarter_de
sc,
calendar_month_desc, time_id) gid_t
FROM TIMES
GROUP BY ROLLUP
(calendar_year, (calendar_quarter_desc, calendar_quarter_number
),
(calendar_month_desc, calendar_month_number), time_id);
Step 3 Creat e the materialized view mv_prod_time to support faster performance
The materialized view definition is a dupli
cate of the view cube_prod_time defined earlier. Because it is a duplicate query, references to cube_prod_time will be rewritten to use the mv_prod_time materialized view. The following materialized may already exist in your sy
stem; if not, create it now. If you must generate it, please note that we limit the query to just two products to keep processing tim
e short.
CREATE MATERIALIZED VIEW mv_prod_time
REFRESH COMPLETE ON DEMAND AS
SELECT
(CASE
WHEN
((GROUPING(calendar_year)=0 )
AND (GROUPING(calendar_quarter_desc)=1 ))
THEN (TO_CHAR(calendar_year) || '_0')
WHEN
((GROUPING(calendar_quarter_desc)=0 )
AND (GROUPING(calendar_month_desc)=1 ))
THEN (TO_CHAR(calendar_quarter_desc) ||
'_1')
WHEN ((GROUPING(calendar_month_desc)=0 )
AND (GROUPING(t.time_id)=1 ))
THEN (TO_CHAR(calendar_month_desc) ||
'_2')
ELSE (TO_CHAR(t.time_id) || '_3')
END) Hierarchical_Time,
calendar_year year, calendar_quarter_desc quarter,
calend
ar_month_desc month, t.time_id day,
prod_category cat, prod_subcategory subcat, p.prod_id prod,
GROUPING_ID(prod_category, prod_s
ubcategory, p.prod_id,
calendar_year, calendar_quarter_desc, calendar_month_desc,t.time_id) gid,
GROUPING_ID(prod_category, pro
d_subcategory, p.prod_id) gid_p,
GROUPING_ID(calendar_year, calendar_quarter_desc,
calendar_month_desc, t.time_id) gid_t,
SUM
(amount_sold) s_sold, COUNT(amount_sold) c_sold, COUNT(*) cnt
FROM SALES s, TIMES t, PRODUCTS p
WHERE s.time_id = t.time_id AND
p.p
rod_name IN ('Bounce', 'Y Box') AND s.prod_id = p.prod_id
GROUP BY
ROLLUP(calendar_year, calendar_quarter_desc, calendar_month_desc
, t.time_id),
ROLLUP(prod_category, prod_subcategory, p.prod_id);
Step 4 Create the comparison query
We have now set the stage for our comparison query. We can obtain period-to-per iod comparison calculations at all time levels. It requires applying analytic functions to a hierarchical cube with dense data along the time dimension.
Some of the calculations we can achieve for each time level are:
Sum of sales for prior period at all levels of time.
Variance in sales over prior period.
Sum of sales in the same period a year ago at all levels of time.
Variance in sales over the same per iod last year.
The following example performs all four of these calculations. It uses a partitioned outer join of th
e views cube_prod_time and edge_time to create an in-line view of dense data called dense_cube_prod_t
ime. The query then uses the LAG function in the same way as the prior single-level example. The outer WHER
E clause specifies time at three levels: the days of August 2001, the entire month, and the entire third quarter of 2001. Note
that the last two rows of the results contain the month level and quarter level aggregations.Note: To make the results easier to rea
d if you are using SQL*Plus, the column headings should be adjusted with the following commands. The commands will fold the column he
adings to reduce line length:
col sales_prior_period heading 'sales_prior|_period' col variance_prior_p eriod heading 'variance|_prior|_period' col sales_same_period_prior_year heading 'sales_same|_period_prior|_year' col variance_same_p eriod_p_year heading 'variance|_same_period|_prior_year'
Here is the query comparing current sales to prior and year ago sales:
SELECT SUBSTR(prod,1,4) prod, SUBSTR(Hierarchical_Time,1,12) ht,
sales, sales_prior_period,
sales - sales_prior_period variance_prior_period,
sales_same_period_prior_year,
sales - sales_same_period_prior_year variance_s
ame_period_p_year
FROM
(SELECT cat, subcat, prod, gid_p, gid_t,
Hierarchical_Time, yr, qtr, mon, day, sales,
LAG(sales, 1)
OVER (PARTITION BY gid_p, cat, subcat, prod,
gid_t ORDER BY yr, qtr, mon, day)
sales_prior_period,
LAG(sales, 1) OVER
(PARTITION BY gid_p, cat, subcat, prod,
gid_t, qtr_num, mon_num, day_num ORDER BY yr)
sales_same_period_prior_year
FRO
M
(SELECT c.gid, c.cat, c.subcat, c.prod, c.gid_p,
t.gid_t, t.yr, t.qtr, t.qtr_num, t.mon, t.mon_num,
t.day, t.day_nu
m, t.Hierarchical_Time, NVL(s_sold,0) sales
FROM cube_prod_time c
PARTITION BY (gid_p, cat, subcat, prod)
RIGHT OUTER JOI
N edge_time t
ON ( c.gid_t = t.gid_t AND
c.Hierarchical_Time = t.Hierarchical_Time)
) dense_cube_prod_time
)
--side by side current and prior year sales
WHERE prod IN (139) AND gid_p=0 AND --1 product and product level data
( (mon IN ('2001-08' ) AND gid_t IN (0, 1)) OR --day and month data
(qtr IN ('2001-03' ) AND gid_t IN (3))) --quarter level
data
ORDER BY day;
variance sales_same variance
sale
s_prior _prior _period_prior _same_period
PROD HT SALES _period _period _year _prior_year
---- ---
--------- ---------- ----------- ---------- ------------- ------------
139 01-AUG-01_3 0 0 0
0 0
139 02-AUG-01_3 1347.53 0 1347.53 0 1347.53
139 03-AUG-01_3 0 1347
.53 -1347.53 42.36 -42.36
139 04-AUG-01_3 57.83 0 57.83 995.75 -937.92
139 05-AUG-
01_3 0 57.83 -57.83 0 0
139 06-AUG-01_3 0 0 0 0
0
139 07-AUG-01_3 134.81 0 134.81 880.27 -745.46
139 08-AUG-01_3 1289.89 134.81
1155.08 0 1289.89
139 09-AUG-01_3 0 1289.89 -1289.89 0 0
139 10-AUG-01_3
0 0 0 0 0
139 11-AUG-01_3 0 0 0 0
0
139 12-AUG-01_3 0 0 0 0 0
139 13-AUG-01_3 0 0
0 0 0
139 14-AUG-01_3 0 0 0 0 0
139 15-AUG-01_3
38.49 0 38.49 1104.55 -1066.06
139 16-AUG-01_3 0 38.49 -38.49 0
0
139 17-AUG-01_3 77.17 0 77.17 1052.03 -974.86
139 18-AUG-01_3 2467.54 77.17 2390.
37 0 2467.54
139 19-AUG-01_3 0 2467.54 -2467.54 127.08 -127.08
139 20-AUG-01_3
0 0 0 0 0
139 21-AUG-01_3 0 0 0 0 0
139 22-AUG-01_3 0 0 0 0 0
139 23-AUG-01_3 1371.43 0 1371.43
0 1371.43
139 24-AUG-01_3 153.96 1371.43 -1217.47 2091.3 -1937.34
139 25-AUG-01_3 0
153.96 -153.96 0 0
139 26-AUG-01_3 0 0 0 0 0
139
27-AUG-01_3 1235.48 0 1235.48 0 1235.48
139 28-AUG-01_3 173.3 1235.48 -1062.18
2075.64 -1902.34
139 29-AUG-01_3 0 173.3 -173.3 0 0
139 30-AUG-01_3 0
0 0 0 0
139 31-AUG-01_3 0 0 0 0 0
139 20
01-08_2 8347.43 7213.21 1134.22 8368.98 -21.55
139 2001-03_1 24356.8 28862.14 -4505.34 2416
8.99 187.81
The first LAG function (sales_prior_period) partitions the data on gid_
p, cat, subcat, prod, gid_t and orders the rows on all the time dimension
columns. It gets the sales value of the prior period by passing an offset of 1. The second LAG function (sales_sam
e_period_prior_year) partitions the data on additional columns qtr_num, mon_num, and day_num
code> and orders it on yr so that, with an offset of 1, it can compute the year ago sales for the same period. The outermost SE
LECT clause computes the variances.