Skip Headers

O racle® Database Data Warehousing Guide
10g Release 1 (10.1)

Part Number B10736-01
Go to Documentation Home
H ome

Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous
 page
Previous
Go to next page
Next
View PDF

21 SQL for Analysi s and Reporting

The following topics provide information about how to improve analytical SQL queries in a data warehou se:

Overview of SQL for Analysis and Reporting

Oracle has enhanced SQL's analytical processing capa bilities by introducing a new family of analytic SQL functions. These analytic functions enable you to calculate:

Rank ing functions include cumulative distributions, percent rank, and N-tiles. Moving window calculations allow you to find moving and cu mulative aggregations, such as sums and averages. Lag/lead analysis enables direct inter-row references so you can calculate period-t o-period changes. First/last analysis enables you to find the first or last value in an ordered group.

Other enhancements to S QL include the CASE expression. CASE expressions provide if-then logic useful in many situations.

In Oracle Database 10g, the SQL reporting capability was further enhanced by the introduction of partitioned outer join. Parti tioned outer join is an extension to ANSI outer join syntax that allows users to selectively densify certain dimensions while keeping others sparse. This allows reporting tools to selectively densify dimensions, for example, the ones that appear in their cross-tabul ar reports while keeping others sparse.

To enhance performance, analytic functions can be parallelized: multiple processes can simultaneously execute all of these statements. These capabilities make calculations easier and more efficient, thereby enhancing da tabase performance, scalability, and simplicity.

Analytic functions are classified as described in Table 2 1-1.

Table 21-1 Analytic Functions and Their Uses 

Type Used For
Ranking< /td> Calculating ranks, percentiles, and n-tiles of the values in a result set.
Windowing Calculating cumulative and moving aggregates. Works with these functions: SUM, AVG, MIN, < code>MAX, COUNT, VARIANCE, STDDEV, FIRST_VALUE, LAST_VALUE, a nd new statistical functions
Reporting Calculating shares, for example, market share. Works with these functions: SUM, AVG, MIN, MAX, COUNT (with/without DISTINCT), VARIANCE, STDDEV, RATIO_TO_REPORT, and new statistical functions
LAG/LEAD Finding a value in a row a specifie d number of rows from a current row.
FIRST/L AST First or last value in an ordered group.
Linear Regression Calculating linear regression and other statistics (slope, intercept, and so on).
Inverse Percentile The value in a data set that corresponds to a specified percentile.
Hypothetical Rank and Distribution The rank or percentile that a row would have if inserted into a specifie d data set.

To perform these operations, the analytic fun ctions add several new elements to SQL processing. These elements build on existing SQL to allow flexible and powerful calculation ex pressions. With just a few exceptions, the analytic functions have these new elements. The processing flow is represented in Figure 21-1.

Figure 21-1 Processing Order

Description of dwhsg021.gif follows
Descriptio n of the illustration dwhsg021.gif

The essential concepts used in analytic functions are:

< font face="arial, helvetica, sans-serif">Figure 21-2 Sliding Window Example

D
escription of dwhsg022.gif follows
Description of the i llustration dwhsg022.gif

< a name="DWHSG0201" id="DWHSG0201">Ranking Functions

A ranking function computes the rank of a record compared to other records in the data set based on the values of a set of measures. The types of ranking function are:

RANK and DENSE_RANK Functions

The RANK and DENSE_RANK functions allow you to rank items in a group, for example, finding th e top three products sold in California last year. There are two functions that perform ranking, as shown by the following syntax:

RANK ( ) OVER ( [query_partition_clause] order_by_clause )
DENSE_RANK ( ) OVER ( [query_partition_clause]
 order_by_clause )

The difference between RANK and DENSE_RANK is that DENSE_RANK l eaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using DENSE_RANK and h ad three people tie for second place, you would say that all three were in second place and that the next person came in third. The < code>RANK function would also give three people in second place, but the next person would be in fifth place.

The follo wing are some relevant points about RANK:

  • Ascending is the default sort order, which you ma y want to change to descending.

  • The expressions in the optional PARTITION BY c lause divide the query result set into groups within which the RANK function operates. That is, RANK gets r eset whenever the group changes. In effect, the value expressions of the PARTITION BY clause define the res et boundaries.

  • If the PARTITION BY clause is missing, then ranks are computed over the entire query result set.

  • The ORDER BY clause specifies the measures ( <value expression>) on which ranking is done and defines the order in which rows are sorted in each g roup (or partition). Once the data is sorted within each partition, ranks are given to each row starting from 1.

  • The NULLS FIRST | NULLS LAST clause indicates the position of NULL s in the ordered sequence, either first or last in the sequence. The order of the sequence would make NULLs compa re either high or low with respect to non-NULL values. If the sequence were in ascending order, then NULLS FIRST implies that NULLs are smaller than all other non-NULL values and NULLS LAST implies they are larger than non-NULL values. It is the opposite for descending order. See the example in "Treatment of NULLs".

  • If the NULLS FIRST | NULL S LAST clause is omitted, then the ordering of the null values depends on the ASC or DESC arguments. Null values are considered larger than any other values. If the ordering sequence is ASC, then nulls will appear last; nulls will appear first otherwise. Nulls are considered equal to other nulls and, therefore, the order in which nulls ar e presented is non-deterministic.

Ranking Order

The following example shows how the [ASC | DESC] opt ion changes the ranking order.

Example 21-1 Ranking Order

SELECT channel_desc, TO_CHAR(SUM
(amount_sold), '9,999,999,999') SALES$,
   RANK() OVER (ORDER BY SUM(amount_sold)) AS default_rank,
   RANK() OVER (ORDER BY SUM(amou
nt_sold) DESC NULLS LAST) AS custom_rank
FROM sales, products, customers, times, channels, countries
WHERE sales.prod_id=products.pro
d_id AND sales.cust_id=customers.cust_id AND
  sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND
  times.calen
dar_month_desc IN ('2000-09', '2000-10') AND country_iso_code='US'
GROUP BY channel_desc; 

CHANNEL_DESC         SALES$         DEFAU
LT_RANK CUSTOM_RANK
-------------------- -------------- ------------ -----------
Direct Sales              2,443,392            3
       1
Partners                  1,365,963            2           2
Internet                    467,478            1           3

<
/pre>

While the data in this result is ordered on the measure SALES$, in general, it is not guaranteed by the RANK function that the data will be sorted on the measures. If you want the data to be sorted on SALES$ in your r esult, you must specify it explicitly with an ORDER BY clause, at the end of the SELECT statem ent.

< h4>Ranking on Multiple Expressions

Ranking functions need to resolve ties between values in the set. If the first expression cannot resolve ties, the second expression is used to resolve ties a nd so on. For example, here is a query ranking three of the sales channels over two months based on their dollar sales, breaking ties with the unit sales. (Note that the TRUNC function is used here only to create tie values for this query.)

Example 21-2 Ranking On Multiple Expressions

SELECT channel_desc, calendar_month_desc, TO_CHAR(TRUNC(S
UM(amount_sold),-5),
  '9,999,999,999') SALES$, TO_CHAR(SUM(quantity_sold), '9,999,999,999') 
  SALES_Count, RANK() OVER (ORDER BY TR
UNC(SUM(amount_sold), -5) DESC, SUM(quantity_sold) DESC) AS col_rank
FROM sales, products, customers, times, channels
WHERE sales.pro
d_id=products.prod_id AND sales.cust_id=customers.cust_id AND
   sales.time_id=times.time_id AND sales.channel_id=channels.channel_id
 AND 
   times.calendar_month_desc IN ('2000-09', '2000-10') AND
   channels.channel_desc<>'Tele Sales'
GROUP BY channel_desc,
calendar_month_desc;

CHANNEL_DESC         CALENDAR SALES$         SALES_COUNT     COL_RANK
-------------------- -------- -----------
--- -------------- ---------
Direct Sales         2000-10       1,200,000         12,584          1
Direct Sales         2000-09
  1,200,000         11,995          2
Partners             2000-10         600,000          7,508          3
Partners             200
0-09         600,000          6,165          4
Internet             2000-09         200,000          1,887          5
Internet
      2000-10         200,000          1,450          6

The sales_count column breaks the ties for three pai rs of values.

RANK and DENSE_RANK Difference

The difference be tween RANK and DENSE_RANK functions is illustrated as follows:

Example 21-3 RANK and DENSE_RANK

SELECT channel_desc, calendar_month_desc,
   TO_CHAR(TRUNC(SUM(amount_sold),-4), '9,999,999,999') SAL
ES$,
      RANK() OVER (ORDER BY TRUNC(SUM(amount_sold),-4) DESC) AS RANK,
DENSE_RANK() OVER (ORDER BY TRUNC(SUM(amount_sold),-4) DES
C) AS DENSE_RANK
FROM sales, products, customers, times, channels
WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cu
st_id AND
  sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND 
  times.calendar_month_desc IN ('2000-09', '200
0-10') AND
      channels.channel_desc<>'Tele Sales'
GROUP BY channel_desc, calendar_month_desc;

CHANNEL_DESC         CALENDAR
 SALES$              RANK DENSE_RANK
-------------------- -------- -------------- --------- ----------
Direct Sales         2000-09
     1,200,000          1          1
Direct Sales         2000-10       1,200,000          1          1
Partners             2000-09
        600,000          3          2
Partners             2000-10         600,000          3          2
Internet             2000-09
         200,000          5          3
Internet             2000-10         200,000          5          3

Note that, in t he case of DENSE_RANK, the largest rank value gives the number of distinct values in the data set.

Per Group Ranking

The RANK function can be made to operate within grou ps, that is, the rank gets reset whenever the group changes. This is accomplished with the PARTITION BY cla use. The group expressions in the PARTITION BY subclause divide the data set into groups within which RANK operates. For example, to rank products within each channel by their dollar sales, you could issue the following stateme nt.

Exam ple 21-4 Per Group Ranking Example 1

SELECT channel_desc, calendar_month_desc, TO_
CHAR(SUM(amount_sold),
 '9,999,999,999') SALES$, RANK() OVER (PARTITION BY channel_desc
  ORDER BY SUM(amount_sold) DESC) AS RANK_BY_
CHANNEL
FROM sales, products, customers, times, channels
WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND

      sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND
      times.calendar_month_desc IN ('2000-08', '2000-
09', '2000-10', '2000-11')
      AND channels.channel_desc IN ('Direct Sales', 'Internet')
GROUP BY channel_desc, calendar_month_desc
;

A single query block can contain more than one ranking function, each partitioning the data into different groups (that is, reset on different boundaries). The groups can be mutually exclusive. The following query ranks products based on their dollar s ales within each month (rank_of_product_per_region) and within each channel (rank_of_product_total).

Example 21-5 Per Group Ranking Example 2

SELECT channel_desc, cal
endar_month_desc, TO_CHAR(SUM(amount_sold),
  '9,999,999,999') SALES$, RANK() OVER (PARTITION BY calendar_month_desc
  ORDER BY SUM(a
mount_sold) DESC) AS RANK_WITHIN_MONTH, RANK() OVER (PARTITION
 BY channel_desc ORDER BY SUM(amount_sold) DESC) AS RANK_WITHIN_CHANNE
L 
FROM sales, products, customers, times, channels
WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND
  sales.time_id=times.time_id AND sales.channel_id=channels.channel_id AND 
      times.calendar_month_desc IN ('2000-08', '2000-09',
 '2000-10', '2000-11')
    AND channels.channel_desc IN ('Direct Sales', 'Internet')
GROUP BY channel_desc, calendar_month_desc;

CHANNEL_DESC           CALENDAR   SALES$         RANK_WITHIN_MONTH RANK_WITHIN_CHANNEL
---------------
-----   --------   -------------- ----------------- -------------------
Direct Sales         2000-08       1,236,104
    1                   1
Internet             2000-08         215,107                     2                   4
Direct Sales
 2000-09       1,217,808                     1                   3
Internet             2000-09         228,241                     2
                   3
Direct Sales         2000-10       1,225,584                     1                   2
Internet             2000
-10         239,236                     2                   2
Direct Sales         2000-11       1,115,239                     1
              4
Internet             2000-11         284,742                     2                   1

Per Cube and Rollup Group Ranking

Analytic functions, RANK for example, ca n be reset based on the groupings provided by a CUBE, ROLLUP, or GROUPING SETS op erator. It is useful to assign ranks to the groups created by CUBE, ROLLUP, and GROUPING SETS queries. See Chapter 20, " SQL for Aggregation in Data Warehouses" for further informa tion about the GROUPING function.

A sample CUBE and ROLLUP query is the following:

SELECT channel_desc, country_iso_code, 
  TO_CHAR(SUM(amount_sold), '9,999,999,999')
 SALES$, RANK() OVER (
PARTITION BY GROUPING_ID(channel_desc, country_iso_code)
    ORDER BY SUM(amount_sold) DESC) AS RANK_PER_GROUP
FROM sales, customers,
 times, channels, countries
WHERE sales.time_id=times.time_id AND sales.cust_id=customers.cust_id AND
   sales.channel_id = channels.
channel_id AND channels.channel_desc
     IN ('Direct Sales', 'Internet') AND times.calendar_month_desc='2000-09'
   AND country_iso_
code IN ('GB', 'US', 'JP')
GROUP BY CUBE(channel_desc, country_iso_code);

CHANNEL_DESC         CO SALES$         RANK_PER_GROUP
----
---------------- -- -------------- --------------
Direct Sales         GB      1,217,808              1
Direct Sales         JP
1,217,808              1
Direct Sales         US      1,217,808              1
Internet             GB        228,241              4
Internet             JP        228,241              4
Internet             US        228,241              4
Direct Sales
    3,653,423              1
Internet                       684,724              2
                     GB      1,446,049
  1
                     JP      1,446,049              1
                     US      1,446,049              1
        4,338,147              1

Treatment of NULLs

NULLs are treated like normal values. Also, for rank computation, a NULL value is assumed to be equal to a nother NULL value. Depending on the ASC | DESC options provided for measures and the NUL LS FIRST | NULLS LAST clause, NULLs will either sort low or high and hence , are given ranks appropriately. The following example shows how NULLs are ranked in different cases:

SELECT  times.time_id time,  sold,
  RANK() OVER (ORDER BY (sold) DESC NULLS LAST) AS NLAST_DESC,
  RANK() OVER (ORDER BY
 (sold) DESC NULLS FIRST) AS NFIRST_DESC,
  RANK() OVER (ORDER BY (sold) ASC NULLS FIRST) AS NFIRST,
  RANK() OVER (ORDER BY (sold) A
SC NULLS LAST) AS NLAST
FROM
  (
   SELECT   time_id , sum(sales.amount_sold)  sold
   FROM sales, products, customers, countries
WHERE sales.prod_id=products.prod_id AND
     sales.cust_id=customers.cust_id AND prod_name IN ('Envoy Ambassador',
       'Mouse Pad
') AND country_iso_code ='GB'
   GROUP BY  time_id)
 v, times
WHERE v.time_id (+) =times.time_id AND calendar_year=1999
      AND cal
endar_month_number=1
ORDER BY sold  DESC NULLS LAST;

TIME            SOLD NLAST_DESC NFIRST_DESC     NFIRST      NLAST
--------- ---
------- ---------- ----------- ---------- ----------
14-JAN-99   25241.48          1          13         31         19
21-JAN-99   24
365.05          2          14         30         18
10-JAN-99   22901.24          3          15         29         17
20-JAN-99   165
78.19          4          16         28         16
16-JAN-99   15881.12          5          17         27         15
30-JAN-99   1563
7.49          6          18         26         14
17-JAN-99   13262.87          7          19         25         13
25-JAN-99   13227
.08          8          20         24         12
03-JAN-99    9885.74          9          21         23         11
28-JAN-99    4471.
08         10          22         22         10
27-JAN-99    3453.66         11          23         21          9
23-JAN-99     925.4
5         12          24         20          8
07-JAN-99     756.87         13          25         19          7
08-JAN-99      571.8
         14          26         18          6
13-JAN-99     569.21         15          27         17          5
02-JAN-99     316.87
        16          28         16          4
12-JAN-99     195.54         17          29         15          3
26-JAN-99      92.96
       18          30         14          2
19-JAN-99      86.04         19          31         13          1
05-JAN-99
      20           1          1         20
01-JAN-99                    20           1          1         20
31-JAN-99
     20           1          1         20
11-JAN-99                    20           1          1         20
06-JAN-99
    20           1          1         20
18-JAN-99                    20           1          1         20
09-JAN-99
   20           1          1         20
29-JAN-99                    20           1          1         20
22-JAN-99
  20           1          1         20
04-JAN-99                    20           1          1         20
24-JAN-99
 20           1          1         20
15-JAN-99                    20           1          1         20

Bottom N Ranking

Bottom N is similar to top N except for the ordering sequence within the rank expression. Using the previous example, you can order SUM(s_amount) ascending instead of descending.

CUME_DIST Function

The CUME_DIST function (defined as the inverse of percentile in some statistical books) computes the position of a specified value relative to a set of values. The order can be ascending or descending. Ascending is the d efault. The range of values for CUME_DIST is from greater than 0 to 1. To compute the CUME_DIST of a value x in a set S of size N, you use the formula:

CUME_DIST(x) =  number of values in S coming before 
   an
d including x in the specified order/ N

Its syntax is:

CUME_DIST ( ) OVER ( [query_partition
_clause] order_by_clause )

The semantics of various options in the CUME_DIST function are similar to those i n the RANK function. The default order is ascending, implying that the lowest value gets the lowest CUME_DIST (as all other values come later than this value in the order). NULLs are treated the same as they are in the RA NK function. They are counted toward both the numerator and the denominator as they are treated like non-NULL val ues. The following example finds cumulative distribution of sales by channel within each month:

SELECT
calendar_month_desc AS MONTH, channel_desc,
     TO_CHAR(SUM(amount_sold) , '9,999,999,999') SALES$,
     CUME_DIST() OVER (PARTITION
 BY calendar_month_desc ORDER BY
         SUM(amount_sold) ) AS CUME_DIST_BY_CHANNEL
FROM sales, products, customers, times, channels

WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND
      sales.time_id=times.time_id AND sales.channel_id=
channels.channel_id AND 
      times.calendar_month_desc IN ('2000-09', '2000-07','2000-08')
GROUP BY calendar_month_desc, channel_de
sc;

MONTH    CHANNEL_DESC         SALES$         CUME_DIST_BY_CHANNEL
-------- -------------------- -------------- -----------------
---
2000-07  Internet                    140,423           .333333333
2000-07  Partners                    611,064           .6666666
67
2000-07  Direct Sales              1,145,275                    1
2000-08  Internet                    215,107           .33333333
3
2000-08  Partners                    661,045           .666666667
2000-08  Direct Sales              1,236,104                    1

2000-09  Internet                    228,241           .333333333
2000-09  Partners                    666,172           .666666667
2000-09  Direct Sales              1,217,808                    1

PERCENT_RANK Function

PERCENT_ RANK is similar to CUME_DIST, but it uses rank values rather than row counts in its numerator. Therefore, it retu rns the percent rank of a value relative to a group of values. The function is available in many popular spreadsheets. PERCENT_ RANK of a row is calculated as:

(rank of row in its partition - 1) / (number of rows in the part
ition - 1)

PERCENT_RANK returns values in the range zero to one. The row(s) with a rank of 1 will have a PERCENT_RANK of zero. Its syntax is:

PERCENT_RANK () OVER ([query_partition_clause] order_by_
clause)

NTILE Function

NTILE allows easy calculation of tertiles, quartiles, deciles and other common summary statistics. This function divides an ordered partition into a specified number of groups called buckets< /strong> and assigns a bucket number to each row in the partition. NTILE is a very useful calculation because it lets us ers divide a data set into fourths, thirds, and other groupings.

The buckets are calculated so that each bucket has exactly th e same number of rows assigned to it or at most 1 row more than the others. For instance, if you have 100 rows in a partition and ask for an NTILE function with four buckets, 25 rows will be assigned a value of 1, 25 rows will have value 2, and so on. T hese buckets are referred to as equiheight buckets.

If the number of rows in the partition does not divide evenly (without a r emainder) into the number of buckets, then the number of rows assigned for each bucket will differ by one at most. The extra rows wil l be distributed one for each bucket starting from the lowest bucket number. For instance, if there are 103 rows in a partition which has an NTILE(5) function, the first 21 rows will be in the first bucket, the next 21 in the second bucket, the next 21 in the third bucket, the next 20 in the fourth bucket and the final 20 in the fifth bucket.

The NTILE function ha s the following syntax:

NTILE (expr) OVER ([query_partition_clause] order_by_clause)

In thi s, the N in NTILE(N) can be a constant (for example, 5) or an expression.

This function, like RANK a nd CUME_DIST, has a PARTITION BY clause for per group computation, an ORDER BY clause for specifying the measures and their sort order, and NULLS FIRST | NULLS LAST clause for the specific treatment of NULLs. For example, the following is an example assigning each mont h's sales total into one of 4 buckets:

SELECT calendar_month_desc AS MONTH , TO_CHAR(SUM(amount_sold),
 '9,999,999,999')
 SALES$, NTILE(4) OVER (ORDER BY SUM(amount_sold)) AS TILE4
FROM sales, products, customers, times, channels
WHERE
sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND
      sales.time_id=times.time_id AND sales.channel_id=channel
s.channel_id AND
      times.calendar_year=2000 AND prod_category= 'Electronics'
GROUP BY calendar_month_desc;

MONTH    SALES$
        TILE4
-------- -------------- ----------
2000-02         242,416          1
2000-01         257,286          1
2000-03
  280,011          1
2000-06         315,951          2
2000-05         316,824          2
2000-04         318,106          2
2000-07
         433,824          3
2000-08         477,833          3
2000-12         553,534          3
2000-10         652,225          4
2000-11         661,147          4
2000-09         691,449          4

NTILE ORDER BY statements must be fully specified to yield reproducible results. Equal values can get distributed across adjacent buckets. To ens ure deterministic results, you must order on a unique key.

ROW_NUMBER Function

The ROW_NUMBER function assigns a unique number (sequentially, starting from 1, as defined by ORDER BY) to each row withi n the partition. It has the following syntax:

ROW_NUMBER ( ) OVER ( [query_partition_clause] order_by_c
lause )

Example 21-6 ROW_NUMBER

SELECT channel_desc, calendar_month_desc,
   TO_CHAR(TR
UNC(SUM(amount_sold), -5), '9,999,999,999') SALES$,
   ROW_NUMBER() OVER (ORDER BY TRUNC(SUM(amount_sold), -6) DESC) AS ROW_NUMBER 
F
ROM sales, products, customers, times, channels
WHERE sales.prod_id=products.prod_id AND sales.cust_id=customers.cust_id AND
      sa
les.time_id=times.time_id AND sales.channel_id=channels.channel_id AND
      times.calendar_month_desc IN ('2001-09', '2001-10')
GROU
P BY channel_desc, calendar_month_desc;

CHANNEL_DESC         CALENDAR SALES$         ROW_NUMBER
-------------------- -------- ------
-------- ----------
Direct Sales         2001-09       1,100,000          1
Direct Sales         2001-10       1,000,000          2
I
nternet             2001-09         500,000          3
Internet             2001-10         700,000          4
Partners             2
001-09         600,000          5
Partners             2001-10         600,000          6

Note that there are three pairs of tie values in these results. Like NTILE, ROW_NUMBER is a non-deterministic function, so each tied value could have its row number switched. To ensure deterministic results, you must order on a unique key. Inmost cases, that will require adding a new tie breaker column to the query and using it in the ORDER BY specification.

Windowing Aggregate Functions

Windowing functions can be used to compute cumulative, moving, and centered aggregates. They return a value for each row in t he table, which depends on other rows in the corresponding window. These functions include moving sum, moving average, moving min/max , cumulative sum, as well as statistical functions. They can be used only in the SELECT and ORDER BY< /code> clauses of the query. Two other functions are available: FIRST_VALUE, which returns the first value in the window ; and LAST_VALUE, which returns the last value in the window. These functions provide access to more than one row of a t able without a self-join. The syntax of the windowing functions is:

{SUM|AVG|MAX|MIN|COUNT|STDDEV|VARIA
NCE|FIRST_VALUE|LAST_VALUE} 
  ({value expression1 | *}) OVER
    ([PARTITION BY value expression2[,...])
      ORD
ER BY value expression3 [collate clause>]
                [ASC| DESC] [NULLS FIRST | NULLS LAST] [,...] 
{ROWS | RANGE} {
BETWEEN
  {UNBOUNDED PRECEDING | CURRENT ROW | value_expr {PRECEDING | FOLLOWING}} AND
  { UNBOUNDED FOLLOWING | CURRENT ROW | value_
expr { PRECEDING | FOLLOWING } }
  | { UNBOUNDED PRECEDING | CURRENT ROW | value_expr PRECEDING}}


See Also:

Oracle Database SQL Reference for further information regarding syntax and re strictions

Treatment of NULLs as Input to Window Functions

Window functions' NULL semantics match the NULL semantics for SQL aggregate functions. Other seman tics can be obtained by user-defined functions, or by using the DECODE or a CASE expression within the wind ow function.

Windowing Functions with Logical Offset

A logical offset can be specified with co nstants such as RANGE 10 PRECEDING, or an expression that evaluates to a constant, or by an interval specification like RANGE INTERVAL N DAY/MONTH/YEAR PRECEDING or an expression that evaluates to an interval. With logical offset, there can only be one expression in the ORDER BY expression list in the function, with type compatible to NUMERIC if offset is numeric, or DATE i f an interval is specified.

Example 21-7 Cumulative Aggregate Function

The following is an example of cumulative amount_sold by customer ID by quarter in 1999:

SELECT c.cust_id, t.calendar_quarter_desc,
TO_CHAR (SUM(amount_sold),
  '9,999,999,999.99') AS Q_SALES, TO_CHAR(SUM(SUM(amount_sold))

OVER (PARTITION BY c.cust_id ORDER BY c.c
ust_id, t.calendar_quarter_desc
ROWS UNBOUNDED
PRECEDING), '9,999,999,999.99') AS CUM_SALES
  FROM sales s, times t, customers c
  WH
ERE s.time_id=t.time_id AND s.cust_id=c.cust_id AND t.calendar_year=2000
  AND c.cust_id IN (2595, 9646, 11111)
  GROUP BY c.cust_id,
 t.calendar_quarter_desc
  ORDER BY c.cust_id, t.calendar_quarter_desc;

   CUST_ID CALENDA Q_SALES           CUM_SALES
---------- --
----- ----------------- -----------------
      2595 2000-01            659.92            659.92
      2595 2000-02            224.79
            884.71
      2595 2000-03            313.90          1,198.61
      2595 2000-04          6,015.08          7,213.69
  9646 2000-01          1,337.09          1,337.09
      9646 2000-02            185.67          1,522.76
      9646 2000-03
   203.86          1,726.62
      9646 2000-04            458.29          2,184.91
     11111 2000-01             43.18             4
3.18
     11111 2000-02             33.33             76.51
     11111 2000-03            579.73            656.24
     11111 2000-04
            307.58            963.82

In this example, the analytic function SUM defines, for each row, a win dow that starts at the beginning of the partition (UNBOUNDED PRECEDING) and ends, by default, at the curren t row.

Nested SUMs are needed in this example since we are performing a SUM over a value that is its elf a SUM. Nested aggregations are used very often in analytic aggregate functions.

< div class="example">

Example 21-8 Moving Aggregate Function

This example of a time-based window shows, for one customer, the moving average of sales for the current month and preceding two months:

SELECT c.cust_id, t.calendar_month_desc, TO_C
HAR (SUM(amount_sold), 
      '9,999,999,999') AS SALES, TO_CHAR(AVG(SUM(amount_sold))
OVER (ORDER BY c.cust_id, t.calendar_month_des
c  ROWS 2 PRECEDING), '9,999,999,999')
AS MOVING_3_MONTH_AVG
FROM sales s, times t, customers c
WHERE s.time_id=t.time_id AND s.cust_
id=c.cust_id AND
      t.calendar_year=1999 AND c.cust_id IN (6510)
GROUP BY c.cust_id, t.calendar_month_desc
ORDER BY c.cust_id, t.c
alendar_month_desc; 

   CUST_ID CALENDAR SALES          MOVING_3_MONTH
---------- -------- -------------- --------------
      6510
1999-04             125            125
      6510 1999-05           3,395          1,760
      6510 1999-06           4,080
2,533
      6510 1999-07           6,435          4,637
      6510 1999-08           5,105          5,207
      6510 1999-09
  4,676          5,405
      6510 1999-10           5,109          4,963
      6510 1999-11             802          3,529

< p>Note that the first two rows for the three month moving average calculation in the output data are based on a smaller interval size than specified because the window calculation cannot reach past the data retrieved by the query. You need to consider the different window sizes found at the borders of result sets. In other words, you may need to modify the query to include exactly what you want.< /p>

Centered Aggregate Function

Calculating windowing aggregate fu nctions centered around the current row is straightforward. This example computes for all customers a centered moving average of sale s for one week in late December 1999. It finds an average of the sales total for the one day preceding the current row and one day fo llowing the current row including the current row as well.

Example 21-9 Centered Aggregate

SELECT t.time_id, TO_CHAR (SUM(amount_sold), '9,999,999,999')
AS SALES, TO_CHAR(AVG(SUM(amount_sold)) OVER
  (ORDER BY t.time_id
RANGE BETWEEN INTERVAL '1' DAY PRECEDING AND
   INTERVAL '1' DAY FOLLOWING), '9,999,999,999') AS CENTERED_3_DAY_AVG
   FROM sales s,
times t
   WHERE s.time_id=t.time_id AND t.calendar_week_number IN (51)
   AND calendar_year=1999
   GROUP BY t.time_id
   ORDER BY t
.time_id;

TIME_ID   SALES          CENTERED_3_DAY
--------- -------------- --------------
20-DEC-99        134,337        106,676
21
-DEC-99         79,015        102,539
22-DEC-99         94,264         85,342
23-DEC-99         82,746         93,322
24-DEC-99
  102,957         82,937
25-DEC-99         63,107         87,062
26-DEC-99         95,123         79,115

The starting and ending rows for each product's centered moving average calculation in the output data are based on just two days, since the window c alculation cannot reach past the data retrieved by the query. Users need to consider the different window sizes found at the borders of result sets: the query may need to be adjusted.

Windowing Aggregate Functi ons in the Presence of Duplicates

The following example illustrates how window aggregate functions compute values when there are duplicates, that is, when multiple rows are returned for a single ordering value. The query retrieves the quantity sold to several customers during a specified time range. (Although we use an inline view to define our base data set, it has no special sign ificance and can be ignored.) The query defines a moving window that runs from the date of the current row to 10 days earlier.Note th at the RANGE keyword is used to define the windowing clause of this example. This means that the window can potentially hold many row s for each value in the range. In this case, there are three pairs of rows with duplicate date values.

Example 21-10 Windowing Aggregate Func tions with Logical Offsets

SELECT time_id, daily_sum, SUM(daily_sum) OVER (ORDER B
Y time_id
RANGE BETWEEN INTERVAL '10' DAY PRECEDING AND CURRENT ROW)
AS current_group_sum
FROM (SELECT time_id, channel_id, SUM(s.qua
ntity_sold)
AS daily_sum
FROM customers c, sales s, countries
WHERE c.cust_id=s.cust_id
AND s.cust_id IN (638, 634, 753, 440 ) AND s.
time_id BETWEEN '01-MAY-00' AND
 '13-MAY-00' GROUP BY time_id, channel_id);

TIME_ID    DAILY_SUM CURRENT_GROUP_SUM
--------- -------
--- -----------------
06-MAY-00        161               161   /*  161   */
10-MAY-00         23               207   /*  161 +(23+23)
*/
10-MAY-00         23               207   /*  161 +(23+23)  */
11-MAY-00         46               345   /*  161 +(23+23)+(46+92)  *
/
11-MAY-00         92               345   /*  161 +(23+23)+(46+92)   */
12-MAY-00         23               368   /*  161 +(23+23)+(4
6+92)+23  */
13-MAY-00         46               529   /*  161 +(23+23)+(46+92)+23+(46+115) */
13-MAY-00        115               529
  /*  161 +(23+23)+(46+92)+23+(46+115) */

In the output of this example, all dates except May 6 and May 12 return two row s with duplicate dates. Examine the commented numbers to the right of the output to see how the values are calculated. Note that each group in parentheses represents the values returned for a single day.

Note that this example applies only when you use the RANGE keyword rather than the ROWS keyword. It is also important to remember that with RANGE, yo u can only use 1 ORDER BY expression in the analytic function's ORDER BY clause. With the ROWS keyword, you can use multiple order by expressions in the analytic function's ORDER BY clause .

Varying Window Size for Each Row

There are situations where it is useful to vary the size of a window for each row, based on a specified condition. For instance, you may want to make the window larger for certain dates and smaller for others. Assume that you want to calculate the moving average of stock price over three work ing days. If you have an equal number of rows for each day for all working days and no non-working days are stored, then you can use a physical window function. However, if the conditions noted are not met, you can still calculate a moving average by using an expres sion in the window size parameters.

Expressions in a window size specification can be made in several different sources. the e xpression could be a reference to a column in a table, such as a time table. It could also be a function that returns the appropriate boundary for the window based on values in the current row. The following statement for a hypothetical stock price database uses a u ser-defined function in its RANGE clause to set window size:

SELECT t_timekey, AVG(stock_p
rice)
        OVER (ORDER BY t_timekey RANGE fn(t_timekey) PRECEDING) av_price
FROM stock, time WHERE st_timekey = t_timekey
ORDER BY
 t_timekey;

In this statement, t_timekey is a date field. Here, fn could be a PL/SQL function with the following specification:

fn(t_timekey) returns

Note that, when window is specified using a number in a window function with ORDE R BY on a date column, then it is converted to mean the number of days. You could have also used the interval lit eral conversion function, as NUMTODSINTERVAL(fn(t_timekey), 'DAY') instead of just fn(t_timekey) to mean th e same thing. You can also write a PL/SQL function that returns an INTERVAL datatype value.

Wind owing Aggregate Functions with Physical Offsets

For windows expressed in rows, the ordering expressions should be uniq ue to produce deterministic results. For example, the following query is not deterministic because time_id is not unique in this result set.

Example 21-11 Windowing Aggregate Functions With Physical Offsets

SEL
ECT t.time_id, TO_CHAR(amount_sold, '9,999,999,999') AS INDIV_SALE,
       TO_CHAR(SUM(amount_sold) OVER (PARTITION BY t.time_id ORDE
R BY t.time_id
ROWS UNBOUNDED PRECEDING), '9,999,999,999') AS CUM_SALES
FROM sales s, times t, customers c
WHERE s.time_id=t.time_id
AND s.cust_id=c.cust_id AND t.time_id IN 
 (TO_DATE('11-DEC-1999'), TO_DATE('12-DEC-1999')) AND c.cust_id 
 BETWEEN 6500 AND 6600
ORD
ER BY t.time_id;

TIME_ID   INDIV_SALE    CUM_SALES
--------- ----------    ---------
12-DEC-99         23           23
12-DEC-99
      9           32
12-DEC-99         14           46
12-DEC-99         24           70
12-DEC-99         19           89

< p>One way to handle this problem would be to add the prod_id column to the result set and order on both time_id and prod_id.

FIRST_VALUE and LAST_VALUE Functions

The FIRST_VALUE and LAST_VALUE functions allow you to select the first and last rows from a window. These rows are especially valuable because they are often used as the baselines in calculations. For in stance, with a partition holding sales data ordered by day, you might ask "How much was each day's sales compared to the first sales day (FIRST_VALUE) of the period?" Or you might wish to know, for a set of rows in increasing sales order, "What was the percentage size of each sale in the region compared to the largest sale (LAST_VALUE) in the region?"

If the IGNORE NULLS option is used with FIRST_VALUE, it will return the first non-null value in the set, o r NULL if all values are NULL. If IGNORE NULLS is used with LAST_VALUE, it will r eturn the last non-null value in the set, or NULL if all values are NULL. The IGNORE NUL LS option is particularly useful in populating an inventory table properly.

Reporting Aggregate Functions

After a query has been processed, aggregate values like t he number of resulting rows or an average value in a column can be easily computed within a partition and made available to other rep orting functions. Reporting aggregate functions return the same aggregate value for every row in a partition. Their behavior with res pect to NULLs is the same as the SQL aggregate functions. The syntax is:

{SUM | AVG | MAX
| MIN | COUNT | STDDEV | VARIANCE} 
  ([ALL | DISTINCT] {value expression1 | *})
   OVER ([PARTITION BY value expression
2[,...]])

In addition, the following conditions apply:

Reporting functions can appear only in the SELECT clause or the ORDER BY clause. The major benefit of reporting functions is their ability to do multiple passes of data in a single query block and speed up query performance. Queries such as "Count the number of salesmen wit h sales more than 10% of city sales" do not require joins between separate query blocks.

For example, consider the question "F or each product category, find the region in which it had maximum sales". The equivalent SQL query using the MAX reporti ng aggregate function would be:

SELECT prod_category, country_region, sales
FROM (SELECT SUBSTR(p.prod_
category,1,8) AS prod_category, co.country_region, SUM(amount_sold) AS sales,
MAX(SUM(amount_sold)) OVER (PARTITION BY prod_category)
 AS MAX_REG_SALES
FROM sales s, customers c, countries co, products p
WHERE s.cust_id=c.cust_id AND c.country_id=co.country_id AND
    s.prod_id =p.prod_id AND s.time_id = TO_DATE('11-OCT-2001')
GROUP BY prod_category, country_region)
WHERE sales = MAX_REG_SALES;

The inner query with the reporting aggregate function MAX(SUM(amount_sold)) returns:

PROD_CAT COUNTRY_REGION            SALES MAX_REG_SALES
-------- -------------------- ---------- -------------
Electron America
s                 581.92        581.92
Hardware Americas                 925.93        925.93
Peripher Americas                3084.4
8       4290.38
Peripher Asia                    2616.51       4290.38
Peripher Europe                  4290.38       4290.38
Periphe
r Oceania                  940.43       4290.38
Software Americas                 4445.7        4445.7
Software Asia
   1408.19        4445.7
Software Europe                  3288.83        4445.7
Software Oceania                  890.25        4445.
7

The full query results are:

PROD_CAT COUNTRY_REGION            SALES
-------- -----------
--------- ----------
Electron Americas                 581.92
Hardware Americas                 925.93
Peripher Europe
   4290.38
Software Americas                 4445.7

Example 21-12 Reporting Aggregate Example

Reporting aggreg ates combined with nested queries enable you to answer complex queries efficiently. For example, what if you want to know the best se lling products in your most significant product subcategories? The following is a query which finds the 5 top-selling products for ea ch product subcategory that contributes more than 20% of the sales within its product category:

SELECT
SUBSTR(prod_category,1,8) AS CATEG, prod_subcategory, prod_id, SALES 
FROM (SELECT p.prod_category, p.prod_subcategory, p.prod_id,
    SUM(amount_sold) AS SALES,
      SUM(SUM(amount_sold)) OVER (PARTITION BY p.prod_category) AS CAT_SALES,
      SUM(SUM(amount_sol
d)) OVER 
         (PARTITION BY p.prod_subcategory) AS SUBCAT_SALES,
      RANK() OVER  (PARTITION BY p.prod_subcategory  
ORDER BY SUM(amount_sold) ) AS RANK_IN_LINE
     FROM sales s, customers c, countries co, products p 
     WHERE s.cust_id=c.cust_id
AND
        c.country_id=co.country_id AND  s.prod_id=p.prod_id AND
        s.time_id=to_DATE('11-OCT-2000')
     GROUP BY p.prod_cat
egory, p.prod_subcategory, p.prod_id
     ORDER BY prod_category, prod_subcategory) 
  WHERE SUBCAT_SALES>0.2*CAT_SALES AND RANK_I
N_LINE<=5;

RATIO_TO_REPORT Function

The RATIO_TO_REPORT function computes the ratio of a value to the sum of a set of value s. If the expression value expression evaluates to NULL, RATIO_TO_REPORT also eva luates to NULL, but it is treated as zero for computing the sum of values for the denominator. Its syntax is:

RATIO_TO_REPORT ( expr ) OVER ( [query_partition_clause] )

In this, the following applies:

Example 21-13 RATIO_TO_REPORT

To calculate RATIO_TO_REPORT of sales for each channel, you might use the following syntax:

SELECT ch.channel_desc, TO_CHAR(SUM(amount_sold),'9,999,999') AS SA
LES,
     TO_CHAR(SUM(SUM(amount_sold)) OVER (), '9,999,999') AS TOTAL_SALES,
     TO_CHAR(RATIO_TO_REPORT(SUM(amount_sold)) OVER (),
 '9.999')
     AS RATIO_TO_REPORT
FROM sales s, channels ch 
WHERE s.channel_id=ch.channel_id  AND s.time_id=to_DATE('11-OCT-2000')

GROUP BY ch.channel_desc;

CHANNEL_DESC         SALES      TOTAL_SALE RATIO_
-------------------- ---------- ---------- ------
Direc
t Sales             14,447     23,183   .623
Internet                    345     23,183   .015
Partners                  8,391     23
,183   .362

LAG/LEAD Functions

The LAG and LEAD functions are useful for comparing values when the relative positions of rows can be known reliably. They work by specifying the count of rows which separate the target row from the current row . Because the functions provide access to more than one row of a table at the same time without a self-join, they can enhance process ing speed. The LAG function provides access to a row at a given offset prior to the current position, and the LEAD function provides access to a row at a given offset after the current position.

LAG/LEAD Syntax

These functions hav e the following syntax:

{LAG | LEAD} ( value_expr [, offset] [, default] ) 
   OVER ( [query_partition_
clause] order_by_clause )

offset is an optional parameter and defaults to 1. default is an optional parameter and is the value returned if offset falls outside the bounds of the table or p artition.

Example 21-14 LAG/LEAD

SELECT time_id, TO_CHAR(SUM(amount_sold),'9,999,999') AS
SALES, 
  TO_CHAR(LAG(SUM(amount_sold),1) OVER (ORDER BY time_id),'9,999,999') AS LAG1,
  TO_CHAR(LEAD(SUM(amount_sold),1) OVER (ORDE
R BY time_id),'9,999,999') AS LEAD1
FROM sales
WHERE time_id>=TO_DATE('10-OCT-2000') AND time_id<=TO_DATE('14-OCT-2000')
GROUP
BY time_id;

TIME_ID   SALES      LAG1       LEAD1
--------- ---------- ---------- ----------
10-OCT-00    238,479                23,
183
11-OCT-00     23,183    238,479     24,616
12-OCT-00     24,616     23,183     76,516
13-OCT-00     76,516     24,616     29,795
14-OCT-00     29,795     76,516

See "Data Densification for Reporting" for information showing how to use the LAG/LEAD functions for doing period-to-period comparison querie s on sparse data.

FIRST/LAST Functions

The FIRST/LAST aggregate functions allow you to rank a data set and work with its top-ranked or bottom-ranked rows. Af ter finding the top or bottom ranked rows, an aggregate function is applied to any desired column. That is, FIRST/ LAST lets you rank on column A but return the result of an aggregate applied on the first-ranked or last-ranked rows of column B. This is valuable because it avoids the need for a self-join or subquery, thus improving performance. These functions' syntax begi ns with a regular aggregate function (MIN, MAX, SUM, AVG, COUNT, VARIANCE, STDDEV) that produces a single return value per group. To specify the ranking used, the FIRST< /code>/LAST functions add a new clause starting with the word KEEP.

FIRST/LAST Syntax

These funct ions have the following syntax:

aggregate_function KEEP ( DENSE_RANK LAST ORDER BY
  expr [ DESC | ASC
] [NULLS { FIRST | LAST }]
  [, expr [ DESC | ASC ] [NULLS { FIRST | LAST }]]...)
[OVER query_partitioning_clause]

Note t hat the ORDER BY clause can take multiple expressions.

FIRST/LAST As Regular Aggreg ates

You can use the FIRST/LAST family of aggregates as regular aggregate functions.

Example 21-15 FIRST/LAST Example 1

The following query lets us compare minimum pric e and list price of our products. For each product subcategory within the Men's clothing category, it returns the following:

SELECT prod_subcategory, MIN(prod_list_price) 
  KEEP (DENSE_RAN
K FIRST ORDER BY (prod_min_price)) AS LP_OF_LO_MINP,
MIN(prod_min_price) AS LO_MINP,
MAX(prod_list_price) KEEP (DENSE_RANK LAST ORDER
 BY (prod_min_price))
 AS LP_OF_HI_MINP,
MAX(prod_min_price) AS HI_MINP
FROM products WHERE prod_category='Electronics'
GROUP BY prod
_subcategory;

PROD_SUBCATEGORY LP_OF_LO_MINP LO_MINP  LP_OF_HI_MINP  HI_MINP
---------------- ------------- -------  ------------- -
---------
Game Consoles          299.99  299.99          299.99     299.99
Home Audio             499.99  499.99          599.99
599.99
Y Box Accessories        7.99    7.99           20.99      20.99
Y Box Games              7.99   7.99            29.99      29
.99

FIRST/LAST As Reporting Aggregates

You can also use the FIRST/LAST< /code> family of aggregates as reporting aggregate functions. An example is calculating which months had the greatest and least incre ase in head count throughout the year. The syntax for these functions is similar to the syntax for any other reporting aggregate.

Consider the example in Example 21-15 for FIRST/LAST. What if we wanted to find the list pri ces of individual products and compare them to the list prices of the products in their subcategory that had the highest and lowest m inimum prices?

The following query lets us find that information for the Documentation subcategory by using FIRST/LAST as reporting aggregates.

Example 21-16 FIRST/LAST Example 2

SELECT prod_id, prod_list_price,
    MIN(prod_list_price) KEEP (DENSE_RANK FIRST ORDER BY (prod_min_price))
       OVER(
PARTITION BY (prod_subcategory)) AS LP_OF_LO_MINP,
    MAX(prod_list_price) KEEP (DENSE_RANK LAST ORDER BY (prod_min_price))
       O
VER(PARTITION BY (prod_subcategory)) AS LP_OF_HI_MINP
FROM products WHERE prod_subcategory = 'Documentation';

   PROD_ID PROD_LIST_P
RICE LP_OF_LO_MINP LP_OF_HI_MINP
---------- --------------- ------------- -------------
        40           44.99         44.99
    44.99
        41           44.99         44.99         44.99
        42           44.99         44.99         44.99
        43
        44.99         44.99         44.99
        44           44.99         44.99         44.99
        45           44.99         4
4.99         44.99

Using the FIRST and LAST functions as reporting aggregates makes it easy to include the results in calculations such "Salary as a percent of the highest salary."

Inverse Percenti le Functions

Using the CUME_DIST function, you can find the cumulative distribution (percentile) of a set of values. However, the inverse operation (finding what value computes to a certain percentile) is neither easy to do nor efficientl y computed. To overcome this difficulty, the PERCENTILE_CONT and PERCENTILE_DISC functions were introduced. These can be used both as window reporting functions as well as normal aggregate functions.

These functions need a sort speci fication and a parameter that takes a percentile value between 0 and 1. The sort specification is handled by using an ORDER BY clause with one expression. When used as a normal aggregate function, it returns a single value for each ordered set.

PERCENTILE_CONT, which is a continuous function computed by interpolation, and PERCENTILE_DISC, which is a step function that assumes discrete values. Like other aggregates, PERCENTILE_CONT and PERCENTILE_DISC operate on a group of rows in a grouped query, but with the following differences:

  • They require a parameter between 0 and 1 (inclusive). A parameter specified out of this range will result in error. This parameter should be speci fied as an expression that evaluates to a constant.

  • They require a sort specification. This sort specifi cation is an ORDER BY clause with a single expression. Multiple expressions are not allowed.

Normal Ag gregate Syntax

[PERCENTILE_CONT | PERCENTILE_DISC]( constant expression ) 
    WITHIN
GROUP ( ORDER BY single order by expression
[ASC|DESC] [NULLS FIRST| NULLS LAST])

Inverse Percentile Example Basis

We use the following query to return the 17 rows of data used in the examples of this section:

SELE
CT cust_id, cust_credit_limit, CUME_DIST() 
   OVER (ORDER BY cust_credit_limit) AS CUME_DIST
FROM customers WHERE cust_city='Marshal
';

   CUST_ID CUST_CREDIT_LIMIT  CUME_DIST
---------- ----------------- ----------
     28344              1500 .173913043
      896
2              1500 .173913043
     36651              1500 .173913043
     32497              1500 .173913043
     15192
  3000 .347826087
    102077              3000 .347826087
    102343              3000 .347826087
      8270              3000 .34782
6087
     21380              5000  .52173913
     13808              5000  .52173913
    101784              5000  .52173913
     304
20              5000  .52173913
     10346              7000 .652173913
     31112              7000 .652173913
     35266
   7000 .652173913
      3424              9000 .739130435
    100977              9000 .739130435
    103066             10000 .7826
08696
     35225             11000 .956521739
     14459             11000 .956521739
     17268             11000 .956521739
    100
421             11000 .956521739
     41496             15000          1

PERCENTILE_DISC(x) is computed by scanning up the CUME_DIST values in each group till you find the first one greater than or equal to x< /code>, where x is the specified percentile value. For the example query where PERCENTILE_DISC(0.5), the re sult is 5,000, as the following illustrates:

SELECT PERCENTILE_DISC(0.5) WITHIN GROUP
  (ORDER BY cust_
credit_limit) AS perc_disc, PERCENTILE_CONT(0.5) WITHIN GROUP 
  (ORDER BY cust_credit_limit) AS perc_cont
 FROM customers WHERE cust
_city='Marshal';

PERC_DISC   PERC_CONT
---------   ---------
     5000        5000

The result of PERCENTILE_CONT is computed by linear interpolation between rows after ordering them. To compute PERCENTILE_CONT(x), we first comp ute the row number = RN= (1+x*(n-1)), where n is the number of rows in the group and x is the specifie d percentile value. The final result of the aggregate function is computed by linear interpolation between the values from rows at ro w numbers CRN = CEIL(RN) and FRN = FLOOR(RN).

The final result will be: PERCENTILE_CONT(X) = if (CRN = FRN = RN), then (value of expression from row at RN) else (CRN - RN) * (value of expression for row at FRN) + (RN -FRN) * (value of expression for row at CRN).

Consi der the previous example query, where we compute PERCENTILE_CONT(0.5). Here n is 17. The row number RN = (1 + 0.5*(n-1))= 9 for both groups. Putting this into the formula, (FRN=CRN=9), we return the value from row 9 as the result.

Another example is, if you want to compute PERCENTILE_CONT(0.66). The computed row number RN=(1 + 0.66*(n-1))= (1 + 0.66*16)= 11.67. PERCENTILE_CONT(0.66) = (12-11.67)*(value of row 11)+(11.67-1 1)*(value of row 12). These results are:

SELECT PERCENTILE_DISC(0.66) WITHIN GROUP 
  (ORDER BY cust_cr
edit_limit) AS perc_disc, PERCENTILE_CONT(0.66) WITHIN GROUP 
  (ORDER BY cust_credit_limit) AS perc_cont
FROM customers WHERE cust_c
ity='Marshal';

 PERC_DISC  PERC_CONT
---------- ----------
      9000       8040

Inverse percentile aggregate functions can appear in the HAVING clause of a query like other existing aggregate functions.

As Reporting Aggregates

You can also use the aggregate functions PERCENTILE_CONT, PERCENTILE_DISC as rep orting aggregate functions. When used as reporting aggregate functions, the syntax is similar to those of other reporting aggregates.

[PERCENTILE_CONT | PERCENTILE_DISC](constant expression) 
WITHIN GROUP ( ORDER BY single
order by expression
[ASC|DESC] [NULLS FIRST| NULLS LAST])
OVER ( [PARTITION BY value expression [,...]] )

T his query computes the same thing (median credit limit for customers in this result set, but reports the result for every row in the result set, as shown in the following output:

SELECT cust_id, cust_credit_limit, PERCENTILE_DISC(0.5) W
ITHIN GROUP
     (ORDER BY cust_credit_limit) OVER () AS perc_disc,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY cust_credit_limit)

  OVER () AS perc_cont
FROM customers WHERE cust_city='Marshal';

   CUST_ID CUST_CREDIT_LIMIT  PERC_DISC  PERC_CONT
---------- ----
------------- ---------- ----------
     28344              1500       5000       5000
      8962              1500       5000
5000
     36651              1500       5000       5000
     32497              1500       5000       5000
     15192              30
00       5000       5000
    102077              3000       5000       5000
    102343              3000       5000       5000
8270              3000       5000       5000
     21380              5000       5000       5000
     13808              5000       50
00       5000
    101784              5000       5000       5000
     30420              5000       5000       5000
     10346
       7000       5000       5000
     31112              7000       5000       5000
     35266              7000       5000       50
00
      3424              9000       5000       5000
    100977              9000       5000       5000
    103066             10000
       5000       5000
     35225             11000       5000       5000
     14459             11000       5000       5000
     172
68             11000       5000       5000
    100421             11000       5000       5000
     41496             15000       5000
       5000

Inverse Percentile Restrictions

For PERCEN TILE_DISC, the expression in the ORDER BY clause can be of any data type that you can sort (numeric, string, date, and so on). However, the expression in the ORDER BY clause must be a numeric or datetime typ e (including intervals) because linear interpolation is used to evaluate PERCENTILE_CONT. If the expression is of type < code>DATE, the interpolated result is rounded to the smallest unit for the type. For a DATE type, the interpolate d value will be rounded to the nearest second, for interval types to the nearest second (INTERVAL DAY TO SECOND) or to the month(INTERVAL YEAR TO MONTH).

L ike other aggregates, the inverse percentile functions ignore NULLs in evaluating the result. For example, when you want to find the median value in a set, Oracle Database ignores the NULLs and finds the median among the non-null values. Yo u can use the NULLS FIRST/NULLS LAST option in the ORDER BY clause, but they will be ignored as NULLs are ignored.

Example 21-18 Hypothetical Rank and Distributi on Example 2

SELECT prod_subcategory,
RANK(10,8) WITHIN GROUP (ORDER BY prod_list_
price DESC,prod_min_price) AS
HRANK, TO_CHAR(PERCENT_RANK(10,8) WITHIN GROUP
(ORDER BY prod_list_price, prod_min_price),'9.999') AS H
PERC_RANK,
TO_CHAR(CUME_DIST (10,8) WITHIN GROUP
(ORDER BY prod_list_price, prod_min_price),'9.999') AS HCUME_DIST
FROM products WHER
E prod_subcategory LIKE 'Recordable%'
GROUP BY prod_subcategory;

PROD_SUBCATEGORY          HRANK   HPERC_   HCUME_
-----------------
---      -----   ------   ------
Recordable CDs                4     .571     .625
Recordable DVD Discs          5     .200     .333

These functions can appear in the HAVING clause of a query just like other aggregate functions. They cannot be used as either reporting aggregate functions or windowing aggregate functions.

Linear Regression Fu nctions

The regression functions sup port the fitting of an ordinary-least-squares regression line to a set of number pairs. You can use them as both aggregate functions or windowing or reporting functions.

The functions are as follows:

Oracle applies the fun ction to the set of (e1, e2) pairs after eliminating all pairs for which either of e1 or e2 is null. e1 is interpreted as a value of the dependent variable (a "y value"), and e2 is interpreted as a val ue of the independent variable (an "x value"). Both expressions must be numbers.

The regression functions are all computed sim ultaneously during a single pass through the data. They are frequently combined with the COVAR_POP, COVAR_SAMP, and CORR functions.

REGR_COUNT Function

REGR_COUNT returns the number of non-null number pairs used to f it the regression line. If applied to an empty set (or if there are no (e1, e2) pairs where neither of e1 or e2 is null), the function returns 0.

REGR_AVGY and REGR_AVGX Functio ns

REGR_AVGY and REGR_AVGX compute the averages of the depend ent variable and the independent variable of the regression line, respectively. REGR_AVGY computes the average of its fi rst argument (e1) after eliminating (e1, e2) pairs where either of e1 or e2 is null. Similarly, REGR_AVGX computes the average of its second argument (e2) after null eliminati on. Both functions return NULL if applied to an empty set.

REGR_SLOPE and REGR_INTERCEPT Functions

The REGR_SLOPE function computes the slope of the regression line fitted to non-null (e1, e2) pairs.

The REGR_INTERCEPT function computes the y-intercept of the regression line. REGR_INTERCEPT returns NULL whenever slope or th e regression averages are NULL.

REGR_R2 Function

The REGR_R2 function computes the coefficient of determination (usually called "R-squared" or "goodness of fit") for the regression line.

REGR_R2 returns values between 0 and 1 when the regression line is defined (slope of the line is not null), and it returns NULL otherwise. The closer the value is to 1 , the better the regression line fits the data.

REGR_SXX, REGR_SYY, an d REGR_SXY Functions

REGR_SXX, REGR_SYY and REGR_SXY functions are used in computing various diagno stic statistics for regression analysis. After eliminating (e1, e2) pairs where either of e1 o r e2 is null, these functions make the following computations:

REGR_SXX:    REGR_COUNT(e1,
e2) * VAR_POP(e2)
REGR_SYY:    REGR_COUNT(e1,e2) * VAR_POP(e1) 
REGR_SXY:    REGR_COUNT(e1,e2) * COVAR_POP(e1, e2)

Linear Regression Statistics Examples

Some common diagnostic statistics that accompany linear regression ana lysis are given in Table 21-2, "Common Diagnostic Statistics and Their Expressions ". Note that this rel ease's new functions allow you to calculate all of these.

Table 21-2 Common Diagnostic Statisti cs and Their Expressions 

Type of Statistic Expression
Adjusted R2 1-((1 - REGR_R2)*((REGR_COU NT-1)/(REGR_COUNT-2)))
Standard error SQRT((REGR_SYY-(POWER(REGR_SXY,2)/REGR_SXX))/(REGR_COUNT-2))
Total sum of squares REGR_SYY
Regress ion sum of squares POWER(REGR_SXY,2) / REGR_SXX
Residual sum of squares REGR_SYY - (POWER(REGR_SXY,2)/REGR_SXX)
t statistic for slope REGR_SLOPE * SQRT(REGR_SXX) / (Standard error)
t statistic for y-int ercept REGR_INTERCEPT / ((Standard error) * SQRT((1/REGR_COUNT)+(P OWER(REGR_AVGX,2)/REGR_SXX))

Sam ple Linear Regression Calculation

In this example, we compute an ordinary-least-squares regression line that expresses the quantity sold of a product as a linear function of the product's list price. The calculations are grouped by sales channel. The values SLOPE, INTCPT, RSQR are slope, intercept, and coefficient of determination of the regre ssion line, respectively. The (integer) value COUNT is the number of products in each channel for whom both quantity sol d and list price data are available.

SELECT s.channel_id, REGR_SLOPE(s.quantity_sold, p.prod_list_price
) SLOPE,
  REGR_INTERCEPT(s.quantity_sold, p.prod_list_price) INTCPT,
  REGR_R2(s.quantity_sold, p.prod_list_price) RSQR,
  REGR_COUN
T(s.quantity_sold, p.prod_list_price) COUNT,
  REGR_AVGX(s.quantity_sold, p.prod_list_price) AVGLISTP,
  REGR_AVGY(s.quantity_sold, p
.prod_list_price) AVGQSOLD
FROM sales s, products p WHERE s.prod_id=p.prod_id
  AND p.prod_category='Electronics' AND s.time_id=to_DA
TE('10-OCT-2000')
GROUP BY s.channel_id;

CHANNEL_ID      SLOPE     INTCPT       RSQR      COUNT   AVGLISTP   AVGQSOLD
---------- ---
------- ---------- ---------- ---------- ---------- ----------
         2          0          1          1         39 466.656667
     1
         3          0          1          1         60     459.99          1
         4          0          1          1
   19 526.305789          1

Frequent Itemsets

Instead of counting how often a given event occurs (for example, how often someone has purchased milk at the grocery), frequent itemsets provides a mechanism for counting how often multiple events occur together (for example, how often someone has purchased both milk and cereal together at the grocery store).

The input to the frequent-itemsets operation is a set of data that represents collections of items (itemsets). Some examples of itemsets could be all of the products that a given customer purchased in a single trip to the grocery store (commonly called a market basket), the web-page s that a user accessed in a single session, or the financial services that a given customer utilizes. The notion of a frequent itemse t is to find those itemsets that occur most often. If you apply the frequent-itemset operator to a grocery store's point-of-sale data , you might, for example, discover that milk and bananas are the most commonly bought pair of items.

Frequent itemsets have th us been used in business intelligence environments for many years, with the most common one being for market basket analysis in the r etail industry. Frequent itemsets are integrated with the database, operating on top of relational tables and accessed through SQL. T his integration provides a couple of key benefits:

  • Applications that previously relied on frequent items et operations now benefit from significantly improved performance as well as simpler implementation.

  • SQL -based applications that did not previously use frequent itemsets can now be easily extended to take advantage of this functionality.

Frequent itemsets analysis is performed with the PL/SQL package DBMS_FREQUENT_ITEMSETS. See PL/SQL Packages and Types Reference for more information.

Other Statistical Functions

Oracle introduces a set of SQL statistical functions and a statistics package, DBMS_STAT_FUNCS. This section lists some of the new functions along with basic syntax.

See PL /SQL Packages and Types Reference for detailed information about the DBMS_STAT_FUNCS package and Oracle Database SQL Reference for syntax and semantics.

Descri ptive Statistics

You can calculate the following descriptive statistics:

  • Median of a Data Set

    Median (expr) [OVER (query_partition_clause)]
    
  • Mode of a Data Set

    STATS_MODE (expr)
    

Hypothesis Testing - Parametric Tests

You can calculate the following descriptive statistics:

  • One-Sample T-Test

    STATS_T_TEST_ONE (expr1, expr2 (a constant) [, return_value])
    
    
  • Paired-Samples T-Test

    STATS_T_TEST_PAIRED (expr1, expr2 [, return_value])
    
    
  • Independent-Samp les T-Test. Pooled Variances

    STATS_T_TEST_INDEP (expr1, expr2 [, return_value])
    
    
  • Independent-Samples T-Test, Unpooled Variances

    STATS_T_TEST_INDEPU (expr1, expr2 [, return_v
    alue])
    
    
  • The F-Test

    STATS_F_TEST (expr1, expr2 [, return_value])
    
    
  • One-Way ANOVA

    STATS_ONE_WAY_ANOVA (expr1, expr2 [, return_value])
    

Crosstab Statistics

You can calculate crosstab statistics using the following syntax:

STATS_CROSSTAB (expr1, expr2 [, return_value])

Can return any one of the following:

    < li type="disc">

    Observed value of chi-squared

  • Significance of observed chi-squared

  • Degree of freedom for chi-squared

  • Phi coefficient, Cramer's V statistic

  • Contingency coefficient

  • Cohen's Kappa

Hypothesis Testing - Non -Parametric Tests

You can calculate hypothesis statistics using the following syntax:

ST
ATS_BINOMIAL_TEST (expr1, expr2, p [, return_value])

  • Binomial Test/Wilcoxon Signed Ranks Test

    < pre xml:space="preserve">STATS_WSR_TEST (expr1, expr2 [, return_value])
  • Mann-Whitney Test

    STATS_MW_TEST (expr1, expr2 [, return_value])
    
    
  • Kolmogorov-Smirnov Test

    STATS_KS_TEST (expr1, expr2 [, return_value])
    

Non-Parametric Correlation< /font>

You can calculate the following parametric statistics:

  • Spearman's rho Coefficient

    CORR_S (expr1, expr2 [, return_value])
  • Kendall's tau-b Coefficient

    CORR_K (expr1, expr2 [, return_value])
    
    

In addition to the functions, this release has a ne w PL/SQL package, DBMS_STAT_FUNCS. It contains the descriptive statistical function SUMMARY along with func tions to support distribution fitting. The SUMMARY function summarizes a numerical column of a table with a variety of d escriptive statistics. The five distribution fitting functions support normal, uniform, Weibull, Poisson, and exponential distributio ns.

WIDTH_BUCKET Function

For a giv en expression, the WIDTH_BUCKET function returns the bucket number that the result of this expression will be assigned a fter it is evaluated. You can generate equiwidth histograms with this function. Equiwidth histograms divide data sets into buckets wh ose interval size (highest value to lowest value) is equal. The number of rows held by each bucket will vary. A related function, NTILE, creates equiheight buckets.

Equiwidth histograms can be generated only for numeric, date or datetime types. S o the first three parameters should be all numeric expressions or all date expressions. Other types of expressions are not allowed. I f the first parameter is NULL, the result is NULL. If the second or the third parameter is NULL, an error message is returned, as a NULL value cannot denote any end point (or any point) for a range in a date or num eric value dimension. The last parameter (number of buckets) should be a numeric expression that evaluates to a positive integer valu e; 0, NULL, or a negative value will result in an error.

Buckets are numbered from 0 to (n+1). Bucke t 0 holds the count of values less than the minimum. Bucket(n+1) holds the count of values greater than or equal to the maximum specified value.

WIDTH_BUCKET Syntax

The WIDTH_BUCKET takes four expressions as parameters. The first parameter is the expression that the equiwidth histogram is for. The second and third parameters are expressions that denote the end points of the acceptable range for the first parameter. The fourth parameter denotes the number of buckets.

WIDTH_BUCKET(expression, minval expression, maxval expression, num buckets)

Consider the following data from table < code>customers, that shows the credit limits of 17 customers. This data is gathered in the query shown in Example 21-19.

CUST_ID    CUST_CREDIT_LIMIT
---------  -----------------
    10346               70
00
    35266               7000
    41496              15000
    35225              11000
     3424               9000
    28344
          1500
    31112               7000
     8962               1500
    15192               3000
    21380               5000
  36651               1500
    30420               5000
     8270               3000
    17268              11000
    14459
    11000
    13808               5000
    32497               1500
   100977               9000
   102077               3000
   1030
66              10000
   101784               5000
   100421              11000
   102343               3000

In the table customers, the column cust_credit_limit contains values between 1500 and 15000, and we can assign the valu es to four equiwidth buckets, numbered from 1 to 4, by using WIDTH_BUCKET (cust_credit_limit, 0, 20000, 4). Ideally each bucket is a closed-open interval of the real number line, for example, bucket number 2 is assigned to scores between 5000.0000 and 9 999.9999..., sometimes denoted [5000, 10000) to indicate that 5,000 is included in the interval and 10,000 is excluded. To accommodat e values outside the range [0, 20,000), values less than 0 are assigned to a designated underflow bucket which is numbered 0, and val ues greater than or equal to 20,000 are assigned to a designated overflow bucket which is numbered 5 (num buckets + 1 in general). Se e Figure 21-3 for a graphical illustration of how the buckets are assigned.

Figur e 21-3 Bucket Assignments

Description of dwhsg078.gif follows
Description of the illustration dwhsg078.gif

You can specify the bounds in the reverse order, for example, WIDTH_BUCKET (cust_credit_l imit, 20000, 0, 4). When the bounds are reversed, the buckets will be open-closed inter vals. In this example, bucket number 1 is (15000,20000], bucket number 2 is (10000,15000], and bucket numbe r 4, is (0,5000]. The overflow bucket will be numbered 0 (20000, +infinity), and the underflow bucket will be numbered 5 (-infinity, 0].

It is an error if the bucket count parameter is 0 or nega tive.

Example 21-19 WIDTH_BUCKET

The following query shows the bucket numbers for the credit limits in the customers table for both cases where the boundaries are specified in regular or reverse order. We use a range of 0 to 20,000.

SELECT cust_id, cust_credit_limit,
 WIDTH_BUCKET(cust_credit_limit,0,20000,4) AS WIDTH_BUCKET_UP,
 WIDTH_BUCKET(cust_credit_limit,
20000, 0, 4) AS WIDTH_BUCKET_DOWN
FROM customers WHERE cust_city = 'Marshal';

   CUST_ID CUST_CREDIT_LIMIT WIDTH_BUCKET_UP WIDTH_BUC
KET_DOWN
---------- ----------------- --------------- -----------------
     10346              7000               2
3
     35266              7000               2                 3
     41496             15000               4                 2
35225             11000               3                 2
      3424              9000               2                 3
     28344
            1500               1                 4
     31112              7000               2                 3
      8962
     1500               1                 4
     15192              3000               1                 4
     21380              50
00               2                 4
     36651              1500               1                 4
     30420              5000
          2                 4
      8270              3000               1                 4
     17268             11000
   3                 2
     14459             11000               3                 2
     13808              5000               2
              4
     32497              1500               1                 4
    100977              9000               2
       3
    102077              3000               1                 4
    103066             10000               3
3
    101784              5000               2                 4
    100421             11000               3                 2
    1
02343              3000               1                 4

User-Defined Aggregate Functions

Oracle offers a facility for creating your own functions, called user-defined aggregate functions. These functions are writ ten in programming languages such as PL/SQL, Java, and C, and can be used as analytic functions or aggregates in materialized views. See Oracle Data Cartridge Developer's Guide for further information regarding syntax and restrictions.

The advantages of these functions are:

  • Highly com plex functions can be programmed using a fully procedural language.

  • Higher scalability than other techni ques when user-defined functions are programmed for parallel processing.

  • Object datatypes can be process ed.

As a simple example of a user-defined aggregate function, consider the skew statistic. This calculation measures if a data set has a lopsided distribution about its mean. It will tell you if one tail of the distribution is significantly larger t han the other. If you created a user-defined aggregate called udskew and applied it to the credit limit data in the prio r example, the SQL statement and results might look like this:

SELECT USERDEF_SKEW(cust_credit_limit) F
ROM customers
WHERE cust_city='Marshal';

USERDEF_SKEW
============
0.583891

Before building user-defined aggregate funct ions, you should consider if your needs can be met in regular SQL. Many complex calculations are possible directly in SQL, particular ly by using the CASE expression.

Staying with regular SQL will enable simpler development, and many query operati ons are already well-parallelized in SQL. Even the earlier example, the skew statistic, can be created using standard, albeit lengthy , SQL.

CASE Expressions

Oracle now supports simple and searched CASE statements. CASE statements are similar in purpose to the DECODE statement, but they offer more flexibility and logical power. They are also easier to read than traditional DECODE statements, and offer better performance as well. They are commonly use d when breaking categories into buckets like age (for example, 20-29, 30-39, and so on). The syntax for simple statements is:

expr WHEN comparison_expr THEN return_expr [, WHEN comparison_expr THEN return_expr]...

The synta x for searched statements is:

WHEN condition THEN return_expr [, WHEN condition THEN return_expr]...

<
/pre>

You can specify only 255 arguments and each WHEN ... THEN pair counts as two arguments. For a wor karound to this limit, see Oracle Database S QL Reference.

Example 21-20 CASE

Suppose you wanted to find the average salary of all employees in the c ompany. If an employee's salary is less than $2000, you want the query to use $2000 instead. Without a CASE statement, y ou would have to write this query as follows,

SELECT AVG(foo(e.sal)) FROM emps e;

In this, foo is a function that returns its input if the input is greater than 2000, and returns 2000 otherwise. The query has pe rformance implications because it needs to invoke a function for each row. Writing custom functions can also add to the development l oad.

Using CASE expressions in the database without PL/SQL, this query can be rewritten as:

SELECT AVG(CASE when e.sal > 2000 THEN e.sal ELSE 2000 end) FROM emps e;

Using a CASE expression lets you avoid developing custom functions and can also perform faster.

Creating Histograms With User-Defi ned Buckets

You can use the CASE statement when you want to obta in histograms with user-defined buckets (both in number of buckets and width of each bucket). The following are two examples of histo grams created with CASE statements. In the first example, the histogram totals are shown in multiple columns and a singl e row is returned. In the second example, the histogram is shown with a label column and a single column for totals, and multiple row s are returned.

Example 21-21 Histogram Example 1

SELECT SUM(CASE WHEN cust_credit_limit B
ETWEEN  0 AND 3999 THEN 1 ELSE 0 END)
  AS "0-3999",
SUM(CASE WHEN cust_credit_limit BETWEEN  4000 AND 7999 THEN 1 ELSE 0 END)
  AS "
4000-7999",
SUM(CASE WHEN cust_credit_limit BETWEEN  8000 AND 11999 THEN 1 ELSE 0 END)
  AS "8000-11999",
SUM(CASE WHEN cust_credit_l
imit BETWEEN  12000 AND 16000 THEN 1 ELSE 0 END)
  AS "12000-16000"
FROM customers WHERE cust_city = 'Marshal';

    0-3999  4000-799
9 8000-11999 12000-16000
---------- ---------- ---------- -----------
         8          7          7           1

Example 21-22 Histogram Example 2

SELECT (CASE WHEN cust_credit_limit BET
WEEN  0 AND 3999 THEN  ' 0 - 3999'
   WHEN cust_credit_limit BETWEEN  4000 AND 7999 THEN ' 4000 - 7999'
   WHEN cust_credit_limit BET
WEEN  8000 AND 11999 THEN  ' 8000 - 11999'
   WHEN cust_credit_limit BETWEEN  12000 AND 16000 THEN '12000 - 16000' END)
  AS BUCKET,
COUNT(*) AS Count_in_Group
FROM customers WHERE cust_city = 'Marshal' GROUP BY
 (CASE WHEN cust_credit_limit BETWEEN  0 AND 3999 THEN
 ' 0 - 3999'
 WHEN cust_credit_limit BETWEEN  4000 AND 7999 THEN ' 4000 - 7999'
 WHEN cust_credit_limit BETWEEN  8000 AND 11999 THEN
 ' 8000 - 11999'
 WHEN cust_credit_limit BETWEEN  12000 AND 16000 THEN '12000 - 16000' END);

BUCKET        COUNT_IN_GROUP
----------
--- --------------
 0 - 3999                  8
 4000 - 7999               7
 8000 - 11999              7
12000 - 16000
1

Data Densification for Reporting

Data is < /a>normally stored in sparse form. That is, if no value exists for a given combination of di mension values, no row exists in the fact table. However, you may want to view the data in dense form, with rows for all combination of dimension values displayed even when no fact data exist for them. For example, if a product did not sell during a particular time period, you may still want to see the product for that time period with zero sales value next to it. Moreover, time series calculatio ns can be performed most easily when data is dense along the time dimension. This is because dense data will fill a consistent number of rows for each period, which in turn makes it simple to use the analytic windowing functions with physical offsets. Data densifica tion is the process of converting spare data into dense form. To overcome the problem of sparsity, you can use a partitioned outer jo in to fill the gaps in a time series or any other dimension. Such a join extends the conventional outer join syntax by applying the o uter join to each logical partition defined in a query. Oracle logically partitions the rows in your query based on the expression yo u specify in the PARTITION BY clause. The result of a partitioned outer join is a UNION of the outer joins of each of the partitions in the logically partitioned table with the table on the other side of the join. Note that you can use this type of join to fill the gaps in any dimension, not just the time dimension. Most of the examples here focus on the tim e dimension because it is the dimension most frequently used as a basis for comparisons.

Partition Join Syntax

The syntax for partitioned outer join extends the ANSI SQL JOIN clause with the phrase PARTITION BY followed by an expression list. The expressions in the list specify the group to which the outer joi n is applied. The following are the two forms of syntax normally used for partitioned outer join:

SELEC
T .....
FROM table_reference
PARTITION BY (expr [, expr ]... )
RIGHT OUTER JOIN table_reference

SELECT .....
FROM table_reference
LE
FT OUTER JOIN table_reference
PARTITION BY {expr [,expr ]...)

Note that FULL OUTER JOIN is not supported with a partitioned outer join.

Sample of Sparse Data

A typical situation with a spa rse dimension is shown in the following example, which computes the weekly sales and year-to-date sales for the product Bounce for we eks 20-30 in 2000 and 2001:

SELECT SUBSTR(p.Prod_Name,1,15) Product_Name, t.Calendar_Year Year,
 t.Cale
ndar_Week_Number Week, SUM(Amount_Sold) Sales
FROM Sales s, Times t, Products p
WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_id
 AND
 p.Prod_name IN ('Bounce') AND t.Calendar_Year IN (2000,2001) AND
 t.Calendar_Week_Number BETWEEN 20 AND 30
GROUP BY p.Prod_Name
, t.Calendar_Year, t.Calendar_Week_Number;

PRODUCT_NAME          YEAR       WEEK      SALES
--------------- ---------- ---------- --
--------
Bounce                2000         20        801
Bounce                2000         21    4062.24
Bounce                2000
         22    2043.16
Bounce                2000         23    2731.14
Bounce                2000         24    4419.36
Bounce
          2000         27    2297.29
Bounce                2000         28    1443.13
Bounce                2000         29    1927.3
8
Bounce                2000         30    1927.38
Bounce                2001         20     1483.3
Bounce                2001
  21    4184.49
Bounce                2001         22    2609.19
Bounce                2001         23    1416.95
Bounce
   2001         24    3149.62
Bounce                2001         25    2645.98
Bounce                2001         27    2125.12
Bounc
e                2001         29    2467.92
Bounce                2001         30    2620.17

In this example, we would ex pect 22 rows of data (11 weeks each from 2 years) if the data were dense. However we get only 18 rows because weeks 25 and 26 are mis sing in 2000, and weeks 26 and 28 in 2001.

Filling Gaps in Data

We can take the sparse data of the preceding query and do a partitioned outer join wi th a dense set of time data. In the following query, we alias our original query as v and we select data from the times table, which we alias as t. Here we retrieve 22 rows because there are no gaps in the series. The four adde d rows each have 0 as their Sales value set to 0 by using the NVL function.

SELECT Product
_Name, t.Year, t.Week, NVL(Sales,0) dense_sales
FROM
 (SELECT SUBSTR(p.Prod_Name,1,15) Product_Name,
  t.Calendar_Year Year, t.Calend
ar_Week_Number Week, SUM(Amount_Sold) Sales
  FROM Sales s, Times t, Products p
  WHERE s.Time_id = t.Time_id AND s.Prod_id = p.Prod_
id AND
   p.Prod_name IN ('Bounce') AND t.Calendar_Year IN (2000,2001) AND
   t.Calendar_Week_Number BETWEEN 20 AND 30
  GROUP BY p.P
rod_Name, t.Calendar_Year, t.Calendar_Week_Number) v
PARTITION BY (v.Product_Name)
RIGHT OUTER JOIN
 (SELECT DISTINCT Calendar_Week_N
umber Week, Calendar_Year Year
  FROM Times
  WHERE Calendar_Year IN (2000, 2001)
  AND Calendar_Week_Number BETWEEN 20 AND 30) t
ON
(v.week = t.week AND v.Year = t.Year)
ORDER BY t.year, t.week;
PRODUCT_NAME          YEAR       WEE
K DENSE_SALES
--------------- ---------- ---------- -----------
Bounce                2000         20         801
Bounce
   2000         21     4062.24
Bounce                2000         22     2043.16
Bounce                2000         23     2731.14
Bo
unce                2000         24     4419.36
Bounce                2000         25           0
Bounce                2000
26           0
Bounce                2000         27     2297.29
Bounce                2000         28     1443.13
Bounce
    2000         29     1927.38
Bounce                2000         30     1927.38
Bounce                2001         20      1483.3
B
ounce                2001         21     4184.49
Bounce                2001         22     2609.19
Bounce                2001
 23     1416.95
Bounce                2001         24     3149.62
Bounce                2001         25     2645.98
Bounce
     2001         26           0
Bounce                2001         27     2125.12
Bounce                2001         28           0
Bounce                2001         29     2467.92
Bounce                2001         30     2620.17

Note that in this que ry, a WHERE condition was placed for weeks between 20 and 30 in the inline view for the time dimension. This was introdu ced to keep the result set small.

Filling Gaps in Two Dimensions

N-dimensional data is typically displayed as a dense 2-dimensional cross tab of (n - 2) page dimensions. This requires that all dimension values for the two dimensions appearing in the cross tab be filled in. The following is another example where the part itioned outer join capability can be used for filling the gaps on two dimensions:

WITH v1 AS
 (SELECT p
.prod_id, country_id, calendar_year,
    SUM(quantity_sold) units, SUM(amount_sold) sales
  FROM sales s, products p, customers c, ti
mes t
  WHERE s.prod_id in (147, 148) AND t.time_id = s.time_id AND
    c.cust_id = s.cust_id AND p.prod_id = s.prod_id
  GROUP BY p.
prod_id, country_id, calendar_year),
v2 AS                          -- countries to use for densifications.
  (SELECT DISTINCT countr
y_id 
  FROM customers
  WHERE country_id IN (52782, 52785, 52786, 52787, 52788)),
v3 AS                          --  years to use fo
r densifications
  (SELECT DISTINCT calendar_year FROM times)
SELECT v4.prod_id, v4.country_id, v3.calendar_year, units, sales
FROM 
  (SELECT prod_id, v2.country_id, calendar_year, units, sales
   FROM v1 PARTITION BY (prod_id) 
   RIGHT OUTER JOIN v2
 --densifies on country
   ON (v1.country_id = v2.country_id)) v4 
PARTITION BY (prod_id,country_id)
RIGHT OUTER JOIN v3
      --densifies on year
ON (v4.calendar_year = v3.calendar_year)
ORDER BY 1, 2, 3;

In this query, the WITH sub-query factoring clause v1, summarizes sales data at the product, country, and year level. This result is sparse but users may want to see all the country, year combinations for each product. To achieve this, we take each partition of v1 based on product values and outer join it on the country dimension first. This will give us all values of country for each product. We then take that result and partition it on product and country values and then outer join it on time dimension. This will give us all time values for each product and country combination.

PROD_ID COUNTRY_ID CALENDAR_YEAR      UNITS
    SALES
---------- ---------- ------------- ---------- ----------
       147      52782          1998
       147      52782
  1999         29     209.82
       147      52782          2000         71     594.36
       147      52782          2001        345
    2754.42
       147      52782          2002
       147      52785          1998          1       7.99
       147      52785
    1999
       147      52785          2000
       147      52785          2001
       147      52785          2002
       147
52786          1998          1       7.99
       147      52786          1999
       147      52786          2000          2      15.
98
       147      52786          2001
       147      52786          2002
       147      52787          1998
       147      52787
         1999
       147      52787          2000
       147      52787          2001
       147      52787          2002
       147
     52788          1998
       147      52788          1999
       147      52788          2000          1       7.99
       147
  52788          2001
       147      52788          2002
       148      52782          1998        139    4046.67
       148      5
2782          1999        228    5362.57
       148      52782          2000        251    5629.47
       148      52782          200
1        308    7138.98
       148      52782          2002
       148      52785          1998
       148      52785          1999
      148      52785          2000
       148      52785          2001
       148      52785          2002
       148      52786
     1998
       148      52786          1999
       148      52786          2000
       148      52786          2001
       148
 52786          2002
       148      52787          1998
       148      52787          1999
       148      52787          2000
   148      52787          2001
       148      52787          2002
       148      52788          1998          4     117.23
148      52788          1999
       148      52788          2000
       148      52788          2001
       148      52788          2
002

Filling Gaps in an Inventory Table

An inventory table typically tracks quantity of units available for various products. This table is sparse: it only stores a row for a product wh en there is an event. For a sales table, the event is a sale, and for the inventory table, the event is a change in quantity availabl e for a product. For example, consider the following inventory table:

CREATE TABLE invent_table (
produ
ct VARCHAR2(10),
time_id DATE,
quant NUMBER);

INSERT INTO invent_table VALUES
 ('bottle', TO_DATE('01/04/01', 'DD/MM/YY'), 10);
INSE
RT INTO invent_table VALUES
 ('bottle', TO_DATE('06/04/01', 'DD/MM/YY'), 8);
INSERT INTO invent_table VALUES
 ('can', TO_DATE('01/04/
01', 'DD/MM/YY'), 15);
INSERT INTO invent_table VALUES
 ('can', TO_DATE('04/04/01', 'DD/MM/YY'), 11);

The inventory table now has the following rows:

PRODUCT    TIME_ID   QUANT
---------- --------- -----
bottle     01-APR-01
    10
bottle     06-APR-01     8
can        01-APR-01    15
can        04-APR-01    11

For reporting purposes, users may want to see this inventory data differently. For example, they may want to see all values of time for each product. This can be acco mplished using partitioned outer join. In addition, for the newly inserted rows of missing time periods, users may want to see the va lues for quantity of units column to be carried over from the most recent existing time period. The latter can be accomplished using analytic window function LAST_VALUE value. Here is the query and the desired output:

WITH
v1 AS
 (SELECT time_id
  FROM times
  WHERE times.time_id BETWEEN
   TO_DATE('01/04/01', 'DD/MM/YY')
   AND TO_DATE('07/04/01', 'DD/M
M/YY'))
SELECT product, time_id, quant quantity,
  LAST_VALUE(quant IGNORE NULLS)
    OVER (PARTITION BY product ORDER BY time_id)
  repeated_quantity
FROM 
 (SELECT product, v1.time_id, quant
  FROM invent_table PARTITION BY (product)
  RIGHT OUTER JOIN v1
  ON (
v1.time_id = invent_table.time_id))
ORDER BY 1, 2;

The inner query computes a partitioned outer join on time within each product. The inner query densifies the data on the time dimension (meaning the time dimension will now have a row for each day of the week). However, the measure column quantity will have nulls for the newly added rows (see the output in the column quantity in the following results.

The outer query uses the analytic function LAST_VALUE. Applying this function partitions the data by product and orders the data on the time dimension column (time_id). For each row, the fu nction finds the last non-null value in the window due to the option IGNORE NULLS, which you can use with b oth LAST_VALUE and FIRST_VALUE. We see the desired output in the column repeated_quantity in t he following output:

PRODUCT    TIME_ID   QUANTITY REPEATED_QUANTITY
---------- --------- -------- ----
-------------
bottle     01-APR-01       10                10
bottle     02-APR-01                         10
bottle     03-APR-01
                      10
bottle     04-APR-01                         10
bottle     05-APR-01                         10
bottle     0
6-APR-01        8                 8
bottle     07-APR-01                          8
can        01-APR-01       15                15
c
an        02-APR-01                         15
can        03-APR-01                         15
can        04-APR-01       11
       11
can        05-APR-01                         11
can        06-APR-01                         11
can        07-APR-01
                  11

Computing Data Values to Fill Gaps

Examples in previous section illustrate how to use partitioned outer join to fill gaps in one or more dimensions. However, the r esult sets produced by partitioned outer join have null values for columns that are not included in the PARTITION BY list. Typically, these are measure columns. Users can make use of analytic SQL functions to replace those null values with a non-null value.

For example, the following query computes monthly totals for products 64MB Memory card and DVD-R Discs (prod uct IDs 122 and 136) for the year 2000. It uses partitioned outer join to densify data for all months. For the missing months, it the n uses the analytic SQL function AVG to compute the sales and units to be the average of the months when the product was sold.

If working in SQL*Plus, the following two commands will wrap the column headings for greater readability of results:

col computed_units  heading  'Computed|_units'
col computed_sales  heading  'Computed|_sales'

WITH V AS
 (SELECT substr(p.prod_name,1,12) prod_name, calendar_month_desc,
     SUM(quantity_sold) units, SUM(amount_sold) sales
   FROM sales
 s, products p, times t
   WHERE s.prod_id in (122,136) AND calendar_year = 2000
     AND t.time_id = s.time_id
     AND p.prod_id =
s.prod_id
   GROUP BY p.prod_name, calendar_month_desc)
SELECT v.prod_name, calendar_month_desc, units, sales,
  NVL(units, AVG(units
) OVER (partition by v.prod_name)) computed_units,
  NVL(sales, AVG(sales) OVER (partition by v.prod_name)) computed_sales
FROM 
  (S
ELECT DISTINCT calendar_month_desc
   FROM times
   WHERE calendar_year = 2000) t
   LEFT OUTER JOIN V
   PARTITION BY (prod_name)
 USING (calendar_month_desc);
                                              computed   computed
PROD_NAME    CALENDAR      UNITS
 SALES     _units     _sales
------------ -------- ---------- ---------- ---------- ----------
64MB Memory  2000-01         112    41
29.72        112    4129.72
64MB Memory  2000-02         190       7049        190       7049
64MB Memory  2000-03          47    172
4.98         47    1724.98
64MB Memory  2000-04          20      739.4         20      739.4
64MB Memory  2000-05          47    1738
.24         47    1738.24
64MB Memory  2000-06          20      739.4         20      739.4
64MB Memory  2000-07
   72.6666667    2686.79
64MB Memory  2000-08                        72.6666667    2686.79
64MB Memory  2000-09
  72.6666667    2686.79
64MB Memory  2000-10                        72.6666667    2686.79
64MB Memory  2000-11
 72.6666667    2686.79
64MB Memory  2000-12                        72.6666667    2686.79
DVD-R Discs, 2000-01         167     3683.5
       167     3683.5
DVD-R Discs, 2000-02         152    3362.24        152    3362.24
DVD-R Discs, 2000-03         188    4148.02
      188    4148.02
DVD-R Discs, 2000-04         144    3170.09        144    3170.09
DVD-R Discs, 2000-05         189    4164.87
     189    4164.87
DVD-R Discs, 2000-06         145    3192.21        145    3192.21
DVD-R Discs, 2000-07
 124.25    2737.71
DVD-R Discs, 2000-08                            124.25    2737.71
DVD-R Discs, 2000-09           1      18.91
     1      18.91
DVD-R Discs, 2000-10                            124.25    2737.71
DVD-R Discs, 2000-11                            1
24.25    2737.71
DVD-R Discs, 2000-12           8     161.84          8     161.84

Time Series Calculations on Densified Data

Densification is not just for reporting purpose. It also enables certain types of calculations, especially, time series calculations. Time series calculations are easier whe n data is dense along the time dimension. Dense data has a consistent number of rows for each time periods which in turn make it simp le to use analytic window functions with physical offsets.

To illustrate, let's first take the example on "Filling Gaps in Data", and let's add an analytic function to that query. In the following enhanced version, we calculate weekly year-to-date sales alongside the weekly sales. The NULL values that the partitioned outer join inserts in making the tim e series dense are handled in the usual way: the SUM function treats them as 0's.

SELECT P
roduct_Name, t.Year, t.Week, NVL(Sales,0) Current_sales,
 SUM(Sales) 
   OVER (PARTITION BY Product_Name, t.year ORDER BY t.week) Cum
ulative_sales
FROM 
 (SELECT SUBSTR(p.Prod_Name,1,15) Product_Name, t.Calendar_Year Year,
   t.Calendar_Week_Number Week, SUM(Amount_
Sold) Sales
  FROM Sales s, Times t, Products p
  WHERE s.Time_id = t.Time_id AND
   s.Prod_id = p.Prod_id AND p.Prod_name IN ('Bounc
e') AND
   t.Calendar_Year IN (2000,2001) AND
   t.Calendar_Week_Number BETWEEN 20 AND 30
  GROUP BY p.Prod_Name, t.Calendar_Year, t.
Calendar_Week_Number) v
PARTITION BY (v.Product_Name)
RIGHT OUTER JOIN
(SELECT DISTINCT
 Calendar_Week_Number Week, Calendar_Year Yea
r
 FROM Times
 WHERE Calendar_Year in (2000, 2001)
 AND Calendar_Week_Number BETWEEN 20 AND 30) t
ON (v.week = t.week AND v.Year = t.
Year)
ORDER BY t.year, t.week;

PRODUCT_NAME          YEAR       WEEK CURRENT_SALES CUMULATIVE_SALES
--------------- ---------- -----
----- ------------- ----------------
Bounce                2000         20           801              801
Bounce                2000
        21       4062.24          4863.24
Bounce                2000         22       2043.16           6906.4
Bounce
2000         23       2731.14          9637.54
Bounce                2000         24       4419.36          14056.9
Bounce
     2000         25             0          14056.9
Bounce                2000         26             0          14056.9
Bounce
          2000         27       2297.29         16354.19
Bounce                2000         28       1443.13         17797.32
Bounce
               2000         29       1927.38          19724.7
Bounce                2000         30       1927.38         21652.08
Bo
unce                2001         20        1483.3           1483.3
Bounce                2001         21       4184.49          5667.
79
Bounce                2001         22       2609.19          8276.98
Bounce                2001         23       1416.95
9693.93
Bounce                2001         24       3149.62         12843.55
Bounce                2001         25       2645.98
    15489.53
Bounce                2001         26             0         15489.53
Bounce                2001         27       2125.12
         17614.65
Bounce                2001         28             0         17614.65
Bounce                2001         29       24
67.92         20082.57
Bounce                2001         30       2620.17         22702.74

Per iod-to-Period Comparison for One Time Level: Example

How do we use this feature to compare values across time periods? Specifically, how do we calculate a year-over-year sales comparison at the week level? The following query returns on the same row, for each product, the year-to-date sales for each week of 2001 with that of 2000.

Note that in this example we start with a WITH clause. This improves readability of the query and lets us focus on the partitioned outer join. If working in SQL*Plu s, the following command will wrap the column headings for greater readability of results:

col Weekly_y
td_sales_prior_year  heading  'Weekly_ytd|_sales_|prior_year'

WITH v AS
  (SELECT SUBSTR(p.Prod_Name,1,6) Prod, t.Calendar_Year Year
,
     t.Calendar_Week_Number Week, SUM(Amount_Sold) Sales
   FROM Sales s, Times t, Products p
   WHERE s.Time_id = t.Time_id AND
   s.Prod_id = p.Prod_id AND p.Prod_name in ('Y Box') AND
     t.Calendar_Year in (2000,2001) AND 
     t.Calendar_Week_Number BETWEE
N 30 AND 40
   GROUP BY p.Prod_Name, t.Calendar_Year, t.Calendar_Week_Number)
SELECT Prod , Year, Week, Sales,
  Weekly_ytd_sales, We
ekly_ytd_sales_prior_year
FROM 
  (SELECT Prod, Year, Week, Sales, Weekly_ytd_sales,
     LAG(Weekly_ytd_sales, 1) OVER 
       (PART
ITION BY Prod , Week ORDER BY Year) Weekly_ytd_sales_prior_year
   FROM 
     (SELECT v.Prod  Prod , t.Year Year, t.Week Week,
  NVL(v.Sales,0) Sales, SUM(NVL(v.Sales,0)) OVER
          (PARTITION BY v.Prod , t.Year ORDER BY t.week) weekly_ytd_sales
      FROM
 v
      PARTITION BY (v.Prod )
      RIGHT OUTER JOIN
        (SELECT DISTINCT Calendar_Week_Number Week, Calendar_Year Year
  FROM Times
         WHERE Calendar_Year IN (2000, 2001)) t
      ON (v.week = t.week AND v.Year = t.Year)
     ) dense_sales
  ) ye
ar_over_year_salesWHERE Year = 2001 AND Week BETWEEN 30 AND 40
ORDER BY 1, 2, 3;
     Weekly_ytd
                                                            _sales_
PROD         YEAR       WEEK      SALES WEEKLY_YT
D_SALES prior_year
------ ---------- ---------- ---------- ---------------- ----------
Y Box        2001         30    7877.45
   7877.45          0
Y Box        2001         31   13082.46         20959.91    1537.35
Y Box        2001         32   11569.02
     32528.93    9531.57
Y Box        2001         33   38081.97          70610.9   39048.69
Y Box        2001         34   33109.65
       103720.55   69100.79
Y Box        2001         35          0        103720.55   71265.35
Y Box        2001         36     4169
.3        107889.85   81156.29
Y Box        2001         37   24616.85         132506.7   95433.09
Y Box        2001         38   377
39.65        170246.35  107726.96
Y Box        2001         39     284.95         170531.3   118817.4
Y Box        2001         40
10868.44        181399.74  120969.69

In the FROM clause of the in-line view dense_sales, we use a partitioned outer join of aggregate view v and time view t to fill gaps in the sales data along the time dimension. The output of the partitioned outer join is then processed by the analytic function SUM ... OVER to compute the weekly year-to-date sales (the weekly_ytd_sales column). Thus, the view dense_sales computes the year-t o-date sales data for each week, including those missing in the aggregate view s. The in-line view year_over_year_sales then computes the year ago weekly year-to-date sales using the LAG function. The LAG function labeled weekly_ytd_sales_prior_year specifies a PARTITION BY clause that pairs rows for the same week of ye ars 2000 and 2001 into a single partition. We then pass an offset of 1 to the LAG function to get the weekly year to dat e sales for the prior year.The outermost query block selects data from year_over_year_sales with the condition yr = 2001 , and thus the query returns, for each product, its weekly year-to-date sales in the specified weeks of years 2001 and 2000.

Period-to-Period Comparison for Multiple Time Levels: Example

W hile the prior example shows us a way to create comparisons for a single time level, it would be even more useful to handle multiple time levels in a single query. For example, we could compare sales versus the prior period at the year, quarter, month and day levels . How can we create a query which performs a year-over-year comparison of year-to-date sales for all levels of our time hierarchy?

We will take several steps to perform this task. The goal is a single query with comparisons at the day, week, month, quarter, a nd year level. The steps are as follows:

  1. We will create a view called cube_prod_time , which holds a hierarchical cube of sales aggregated across times and products.

  2. Then we w ill create a view of the time dimension to use as an edge of the cube. The time edge, which holds a complete set of dates, will be pa rtitioned outer joined to the sparse data in the view cube_prod_time.

  3. Finally, for maximum performance, we will create a materialized view, mv_prod_time, built using the same definition as cube_prod_time.

For more information regarding hierarchical cubes, see Chapter 20, " SQL for Aggregation in Data Warehouses". The materialized view is defined using the following statement:


Step 1 Create the hierarchical cube view

The materialized view shown in the following may al ready exist in your system; if not, create it now. If you must generate it, please note that we limit the query to just two products to keep processing time short:

CREATE OR REPLACE VIEW cube_prod_time AS
SELECT 
  (CASE
     WHEN ((GRO
UPING(calendar_year)=0 ) 
       AND (GROUPING(calendar_quarter_desc)=1 ))
       THEN (TO_CHAR(calendar_year) || '_0')
     WHEN ((G
ROUPING(calendar_quarter_desc)=0 )
       AND (GROUPING(calendar_month_desc)=1 ))
       THEN (TO_CHAR(calendar_quarter_desc) || '_1'
)
     WHEN ((GROUPING(calendar_month_desc)=0 )
       AND (GROUPING(t.time_id)=1 ))
       THEN (TO_CHAR(calendar_month_desc) || '_2
')
     ELSE (TO_CHAR(t.time_id) || '_3')
  END) Hierarchical_Time,
  calendar_year year, calendar_quarter_desc quarter,
  calendar_m
onth_desc month, t.time_id day,
  prod_category cat, prod_subcategory subcat, p.prod_id prod,
  GROUPING_ID(prod_category, prod_subca
tegory, p.prod_id,
    calendar_year, calendar_quarter_desc, calendar_month_desc,t.time_id) gid,
  GROUPING_ID(prod_category, prod_su
bcategory, p.prod_id) gid_p,
  GROUPING_ID(calendar_year, calendar_quarter_desc,
    calendar_month_desc, t.time_id) gid_t,
  SUM(amo
unt_sold) s_sold, COUNT(amount_sold) c_sold, COUNT(*) cnt
FROM SALES s, TIMES t, PRODUCTS p
WHERE s.time_id = t.time_id AND
  p.prod_
name IN ('Bounce', 'Y Box') AND s.prod_id = p.prod_id
GROUP BY
  ROLLUP(calendar_year, calendar_quarter_desc, calendar_month_desc, t.
time_id),
  ROLLUP(prod_category, prod_subcategory, p.prod_id);

Because this view is limited to two products, it returns just over 2200 rows. Note that the column Hierarchical_Time contains string representations of time from all levels of t he time hierarchy. The CASE expression used for the Hierarchical_Time column appends a marker (_0, _1, ...) to each date string to denote the time level of the value. A _0 represents the year level, _1 is quarters, _2 is months, and _3 is d ay. Note that the GROUP BY clause is a concatenated ROLLUP which specifies the rollup hierarch y for the time and product dimensions. The GROUP BY clause is what determines the hierarchical cube content s.

Step 2 Create the view edge_time, which is a complete set of date values< /font>

edge_time is the source for filling time gaps in the hierarchical cube using a partitioned outer join. The column Hierarchical_Time in edge_time will be used in a partitioned join with the Hierarchi cal_Time column in the view cube_prod_time. The following statement defines edge_time:

CREATE OR REPLACE VIEW edge_time AS
SELECT 
  (CASE
     WHEN ((GROUPING(calendar_year)=0 )
       AND (GROUPING(cal
endar_quarter_desc)=1 ))
       THEN (TO_CHAR(calendar_year) || '_0')
     WHEN ((GROUPING(calendar_quarter_desc)=0 )
       AND (GRO
UPING(calendar_month_desc)=1 ))
       THEN (TO_CHAR(calendar_quarter_desc) || '_1')
     WHEN ((GROUPING(calendar_month_desc)=0 )
     AND (GROUPING(time_id)=1 ))
       THEN (TO_CHAR(calendar_month_desc) || '_2')
     ELSE (TO_CHAR(time_id) || '_3')
   END) Hier
archical_Time,
   calendar_year yr, calendar_quarter_number qtr_num,
   calendar_quarter_desc qtr, calendar_month_number mon_num,
calendar_month_desc mon, time_id - TRUNC(time_id, 'YEAR') + 1 day_num,
   time_id day,
GROUPING_ID(calendar_year, calendar_quarter_de
sc,
  calendar_month_desc, time_id) gid_t
FROM TIMES
GROUP BY ROLLUP
 (calendar_year, (calendar_quarter_desc, calendar_quarter_number
),
 (calendar_month_desc, calendar_month_number), time_id);

Step 3 Creat e the materialized view mv_prod_time to support faster performance

The materialized view definition is a dupli cate of the view cube_prod_time defined earlier. Because it is a duplicate query, references to cube_prod_time will be rewritten to use the mv_prod_time materialized view. The following materialized may already exist in your sy stem; if not, create it now. If you must generate it, please note that we limit the query to just two products to keep processing tim e short.

CREATE MATERIALIZED VIEW mv_prod_time
REFRESH COMPLETE ON DEMAND AS
SELECT 
  (CASE
     WHEN
((GROUPING(calendar_year)=0 )
       AND (GROUPING(calendar_quarter_desc)=1 ))
       THEN (TO_CHAR(calendar_year) || '_0')
     WHEN
 ((GROUPING(calendar_quarter_desc)=0 )
       AND (GROUPING(calendar_month_desc)=1 ))
       THEN (TO_CHAR(calendar_quarter_desc) ||
'_1')
     WHEN ((GROUPING(calendar_month_desc)=0 )
       AND (GROUPING(t.time_id)=1 ))
       THEN (TO_CHAR(calendar_month_desc) ||
 '_2')
     ELSE (TO_CHAR(t.time_id) || '_3')
  END) Hierarchical_Time,
  calendar_year year, calendar_quarter_desc quarter,
  calend
ar_month_desc month, t.time_id day,
  prod_category cat, prod_subcategory subcat, p.prod_id prod,
  GROUPING_ID(prod_category, prod_s
ubcategory, p.prod_id,
    calendar_year, calendar_quarter_desc, calendar_month_desc,t.time_id) gid,
  GROUPING_ID(prod_category, pro
d_subcategory, p.prod_id) gid_p,
  GROUPING_ID(calendar_year, calendar_quarter_desc,
    calendar_month_desc, t.time_id) gid_t,
  SUM
(amount_sold) s_sold, COUNT(amount_sold) c_sold, COUNT(*) cnt
FROM SALES s, TIMES t, PRODUCTS p
WHERE s.time_id = t.time_id AND
  p.p
rod_name IN ('Bounce', 'Y Box') AND s.prod_id = p.prod_id
GROUP BY
  ROLLUP(calendar_year, calendar_quarter_desc, calendar_month_desc
, t.time_id),
  ROLLUP(prod_category, prod_subcategory, p.prod_id);

Step 4 Create the comparison query

We have now set the stage for our comparison query. We can obtain period-to-per iod comparison calculations at all time levels. It requires applying analytic functions to a hierarchical cube with dense data along the time dimension.

Some of the calculations we can achieve for each time level are:

  • Sum of sales for prior period at all levels of time.

  • Variance in sales over prior period.

  • Sum of sales in the same period a year ago at all levels of time.

  • Variance in sales over the same per iod last year.

The following example performs all four of these calculations. It uses a partitioned outer join of th e views cube_prod_time and edge_time to create an in-line view of dense data called dense_cube_prod_t ime. The query then uses the LAG function in the same way as the prior single-level example. The outer WHER E clause specifies time at three levels: the days of August 2001, the entire month, and the entire third quarter of 2001. Note that the last two rows of the results contain the month level and quarter level aggregations.Note: To make the results easier to rea d if you are using SQL*Plus, the column headings should be adjusted with the following commands. The commands will fold the column he adings to reduce line length:

col sales_prior_period heading 'sales_prior|_period'
col variance_prior_p
eriod heading 'variance|_prior|_period'
col sales_same_period_prior_year heading 'sales_same|_period_prior|_year'
col variance_same_p
eriod_p_year heading 'variance|_same_period|_prior_year'

Here is the query comparing current sales to prior and year ago sales:

SELECT SUBSTR(prod,1,4) prod, SUBSTR(Hierarchical_Time,1,12) ht, 
  sales, sales_prior_period, 
  sales - sales_prior_period variance_prior_period,
  sales_same_period_prior_year,
  sales - sales_same_period_prior_year variance_s
ame_period_p_year
FROM
 (SELECT cat, subcat, prod, gid_p, gid_t, 
    Hierarchical_Time, yr, qtr, mon, day, sales,
    LAG(sales, 1)
OVER (PARTITION BY gid_p, cat, subcat, prod,
      gid_t ORDER BY yr, qtr, mon, day)
      sales_prior_period,
    LAG(sales, 1) OVER
 (PARTITION BY gid_p, cat, subcat, prod,
      gid_t, qtr_num, mon_num, day_num ORDER BY yr)
      sales_same_period_prior_year
  FRO
M 
   (SELECT c.gid, c.cat, c.subcat, c.prod, c.gid_p,
      t.gid_t, t.yr, t.qtr, t.qtr_num, t.mon, t.mon_num,
      t.day, t.day_nu
m, t.Hierarchical_Time, NVL(s_sold,0) sales
    FROM cube_prod_time c
    PARTITION BY (gid_p, cat, subcat, prod)
    RIGHT OUTER JOI
N edge_time t
    ON ( c.gid_t = t.gid_t AND 
      c.Hierarchical_Time = t.Hierarchical_Time)
   ) dense_cube_prod_time
 )
              --side by side current and prior year sales
WHERE prod IN (139) AND gid_p=0 AND    --1 product and product level data
 ( (mon IN ('2001-08' ) AND gid_t IN (0, 1)) OR --day and month data
  (qtr IN ('2001-03' ) AND gid_t IN (3)))        --quarter level
 data
ORDER BY day;

                                           variance    sales_same     variance
                             sale
s_prior     _prior _period_prior _same_period
PROD HT                SALES     _period    _period         _year  _prior_year
---- ---
--------- ---------- ----------- ---------- ------------- ------------
139  01-AUG-01_3           0           0          0
  0            0
139  02-AUG-01_3     1347.53           0    1347.53             0      1347.53
139  03-AUG-01_3           0     1347
.53   -1347.53         42.36       -42.36
139  04-AUG-01_3       57.83           0      57.83        995.75      -937.92
139  05-AUG-
01_3           0       57.83     -57.83             0            0
139  06-AUG-01_3           0           0          0             0
           0
139  07-AUG-01_3      134.81           0     134.81        880.27      -745.46
139  08-AUG-01_3     1289.89      134.81
   1155.08             0      1289.89
139  09-AUG-01_3           0     1289.89   -1289.89             0            0
139  10-AUG-01_3
           0           0          0             0            0
139  11-AUG-01_3           0           0          0             0
       0
139  12-AUG-01_3           0           0          0             0            0
139  13-AUG-01_3           0           0
     0             0            0
139  14-AUG-01_3           0           0          0             0            0
139  15-AUG-01_3
   38.49           0      38.49       1104.55     -1066.06
139  16-AUG-01_3           0       38.49     -38.49             0
   0
139  17-AUG-01_3       77.17           0      77.17       1052.03      -974.86
139  18-AUG-01_3     2467.54       77.17    2390.
37             0      2467.54
139  19-AUG-01_3           0     2467.54   -2467.54        127.08      -127.08
139  20-AUG-01_3
   0           0          0             0            0
139  21-AUG-01_3           0           0          0             0            0

139  22-AUG-01_3           0           0          0             0            0
139  23-AUG-01_3     1371.43           0    1371.43
           0      1371.43
139  24-AUG-01_3      153.96     1371.43   -1217.47        2091.3     -1937.34
139  25-AUG-01_3           0
      153.96    -153.96             0            0
139  26-AUG-01_3           0           0          0             0            0
139
  27-AUG-01_3     1235.48           0    1235.48             0      1235.48
139  28-AUG-01_3       173.3     1235.48   -1062.18
 2075.64     -1902.34
139  29-AUG-01_3           0       173.3     -173.3             0            0
139  30-AUG-01_3           0
       0          0             0            0
139  31-AUG-01_3           0           0          0             0            0
139  20
01-08_2       8347.43     7213.21    1134.22       8368.98       -21.55
139  2001-03_1       24356.8    28862.14   -4505.34      2416
8.99       187.81

The first LAG function (sales_prior_period) partitions the data on gid_ p, cat, subcat, prod, gid_t and orders the rows on all the time dimension columns. It gets the sales value of the prior period by passing an offset of 1. The second LAG function (sales_sam e_period_prior_year) partitions the data on additional columns qtr_num, mon_num, and day_num and orders it on yr so that, with an offset of 1, it can compute the year ago sales for the same period. The outermost SE LECT clause computes the variances.