Date Dimension vs. Function-Based Indexes: Is the Date Dimension Still Relevant?

Posted in: Technical Track

The purpose of this post is to verify if a date dimension is better in regards to performance and functionality than a series of function-based indexes on a date column in the fact table.

The following is the date dimension and fact table used for the testing:

Date Dimension

create table date_dimension(
date_id number, --1..x days
sql_date date, -- trunc(date)
day_of_week varchar2(30), -- (long version)
day_of_week_short varchar2(3), -- (short version)
day_number_in_week number, -- (1..7)
day_number_in_month number, --(1..31)
day_number_in_year number, --(1..366)
workday varchar2(10), --(either workday or holiday)
holiday varchar2(50), 
  --(specify text e.g. christmas, easter...)
weekday varchar2(10), -- (specify weekday or weekend)
last_day_in_month number, -- (28,29,30 or 31)
week_number_in_year number , --1..52
week_number_in_month number, --1,5
month varchar2(10) ,--(January...)
month_short varchar2(3), -- (Jan...)
month_number_in_year number, -- (1..12)
quarter_in_year number, -- (1,2,3,4)
year number, -- (4 digits)
chinesezodiac varchar2(50),
zodiac varchar2(50) ,
event varchar2(100) -- (election day,Hurricane Hugo) text field
) ;

The are rows for every day from the year 1850 to 2050 (way after I retire). There is a primary key on date_id, a unique key on sql_date, and bitmap indexes on all other columns. The table is compressed and its stats computed.

Segment sizes:

WEEK_NUMBER_IN_YEAR_IDX	.125
MONTH_IDX	.0625
MONTH_SHORT_IDX	.0625
MONTH_NUMBER_IN_YEAR_IDX	.0625
QUARTER_IN_YEAR_IDX	.0625
YEAR_IDX	.0625
EVENT_IDX	.0625
WEEK_NUMBER_IN_MONTH_IDX	.125
CHINESEZODIAC_IDX	.0625
ZODIAC_IDX	.0625
WORKDAY_IDX	.0625
HOLIDAY_IDX	.0625
WEEKDAY_IDX	.0625
LAST_DAY_IN_MONTH_IDX	.0625
DATE_DIMENSION_SQL_DATE	2
DATE_DIMENSION_PK	2
DAY_OF_WEEK_IDX	.1875
DAY_OF_WEEK_SHORT_IDX	.1875
DAY_NUMBER_IN_WEEK_IDX	.1875
DAY_NUMBER_IN_MONTH_IDX	.1875
DAY_NUMBER_IN_YEAR_IDX	.3125

Fact Table

create table fact (id number, tx_date_dm_id number, tx_date date, amount number);

There are 1 million rows in this table, with a compressed index on the date_id and tx_date (since it’s truncated to the day), as well as the following function-based indexes on the tx_date column:

create index fact_tx_date_mm on fact( to_char(tx_date,'mm')) ;
create index fact_tx_date_yyyy on fact( trunc(tx_date,'yyyy')) ;
create index fact_tx_date_wnim on fact( to_char(tx_date,'w')) ;
create index fact_tx_date_dow on fact( to_char(tx_date,'dy')) ;

Segment Sizes:

FACT	32
FACT_TX_DATE_YYYY	.25
FACT_TX_DATE_WNIM	.25
FACT_TX_DATE_DOW	.25
FACT_TX_DATE_MM	.25
FACT_TX_DATE_IDX	13
FACT_DATE_ID_IDX	13

So let’s start looking at typical queries and see how they perform.

Query 1: sum of amount for a specific date range

The fact date column-based query:

select sum(amount), count(*) from fact f where f.tx_date between to_date('20070501','yyyymmdd') and to_date('20070601','yyyymmdd');

SUM(AMOUNT)   COUNT(*)
----------- ----------
 9.8095E+10      33251

Elapsed: 00:00:00.38

Execution Plan
----------------------------------------------------------
Plan hash value: 2208459562

-------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                  |     1 |    21 |   324   (1)| 00:00:04 |
|   1 |  SORT AGGREGATE              |                  |     1 |    21 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID| FACT             | 34530 |   708K|   324   (1)| 00:00:04 |
|*  3 |    INDEX RANGE SCAN          | FACT_TX_DATE_IDX | 34530 |       |    57   (2)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("F"."TX_DATE">=TO_DATE('2007-05-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND
              "F"."TX_DATE"<=TO_DATE('2007-06-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        309  consistent gets
        107  physical reads
          0  redo size
        594  bytes sent via SQL*Net to client
        469  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

The date dimension-based query:

select  sum(amount), count(*) from fact f, date_dimension d where f.date_id  = d.date_id and sql_date between to_date('20070501','yyyymmdd') and to_date('20070601','yyyymmdd');

SUM(AMOUNT)   COUNT(*)
----------- ----------
 1.1881E+12     400584

Elapsed: 00:00:29.95

Execution Plan
----------------------------------------------------------
Plan hash value: 182493474

----------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name                    | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |                         |     1 |    31 |   116   (1)| 00:00:02 |
|   1 |  SORT AGGREGATE                |                         |     1 |    31 |            |          |
|   2 |   TABLE ACCESS BY INDEX ROWID  | FACT                    |  1437 | 25866 |    15   (0)| 00:00:01 |
|   3 |    NESTED LOOPS                |                         | 47414 |  1435K|   116   (1)| 00:00:02 |
|   4 |     TABLE ACCESS BY INDEX ROWID| DATE_DIMENSION          |    33 |   429 |     4   (0)| 00:00:01 |
|*  5 |      INDEX RANGE SCAN          | DATE_DIMENSION_SQL_DATE |    33 |      |     2   (0)| 00:00:01 |
|*  6 |     INDEX RANGE SCAN           | FACT_DATE_ID_IDX        |  1437 |      |     3   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   5 - access("SQL_DATE">=TO_DATE('2007-05-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND
              "SQL_DATE"<=TO_DATE('2007-06-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss'))
   6 - access("F"."DATE_ID"="D"."DATE_ID")

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        415  consistent gets
          0  physical reads
          0  redo size
        594  bytes sent via SQL*Net to client
        469  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          1  rows processed

Query 2: sum of amount for all Thanksgiving days

The fact-only-based query:

select sum(amount), count(*) from fact where to_char(tx_date,'dy') = 'thu' and to_char(tx_date,'mm')='11' and to_char(tx_date,'w')='4';

Elapsed: 00:00:00.07

Execution Plan
----------------------------------------------------------
Plan hash value: 1038184904

---------------------------------------------------------------------------------------------------
| Id | Operation 			| Name 		    | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT 			| 		    |    1 |    21 | 374 (1)| 00:00:05 |
| 1 | SORT AGGREGATE 			| 		    |    1 |    21 |        |          |
| 2 | TABLE ACCESS BY INDEX ROWID 	| FACT 		    | 2381 | 50001 | 374 (1)| 00:00:05 |
| 3 | BITMAP CONVERSION TO ROWIDS	| 		    |      |       |        | 	       |
| 4 | BITMAP AND 			| 		    |      |       | 	    | 	       |
|* 5 | BITMAP INDEX SINGLE VALUE	| FACT_TX_DATE_MM   |      |       | 	    | 	       |
|* 6 | BITMAP INDEX SINGLE VALUE	| FACT_TX_DATE_DOW  |      |       | 	    | 	       |
|* 7 | BITMAP INDEX SINGLE VALUE	| FACT_TX_DATE_WNIM |      |       | 	    | 	       |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

5 - access(TO_CHAR(INTERNAL_FUNCTION("TX_DATE"),'mm')='11')
6 - access(TO_CHAR(INTERNAL_FUNCTION("TX_DATE"),'dy')='thu')
7 - access(TO_CHAR(INTERNAL_FUNCTION("TX_DATE"),'w')='4')

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
39 consistent gets
0 physical reads
0 redo size
591 bytes sent via SQL*Net to client
469 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

The date-dimension-based query:

select sum(amount), count(*) from fact f, date_dimension d where f.date_id = d.date_id and d.holiday = 'Thanksgiving';

Elapsed: 00:00:00.04

Execution Plan
----------------------------------------------------------
Plan hash value: 688750045

----------------------------------------------------------------------------------------------------
| Id | Operation                  | Name             | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT            |                  | 1    | 25    | 758 (2)    | 00:00:10 |
| 1 | SORT AGGREGATE              |                  | 1    | 25    |            | |
| 2 | TABLE ACCESS BY INDEX ROWID | FACT             | 1437 | 25866 | 15 (0)     | 00:00:01 |
| 3 | NESTED LOOPS                |                  | 302K | 7383K | 758 (2)    | 00:00:10 |
| 4 | TABLE ACCESS BY INDEX ROWID | DATE_DIMENSION   | 211  | 1477  | 94 (2)     | 00:00:02 |
| 5 | BITMAP CONVERSION TO ROWIDS |                  |      |       |            | |
|* 6 | BITMAP INDEX SINGLE VALUE  | HOLIDAY_IDX      |      |       |            | |
|* 7 | INDEX RANGE SCAN           | FACT_DATE_ID_IDX | 1437 |       | 3 (0)      | 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

6 - access("D"."HOLIDAY"='Thanksgiving')
7 - access("F"."DATE_ID"="D"."DATE_ID")

Statistics
----------------------------------------------------------
410 recursive calls
0 db block gets
742 consistent gets
0 physical reads
116 redo size
591 bytes sent via SQL*Net to client
469 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
5 sorts (memory)
0 sorts (disk)
1 rows processed

So far, from a query performance point of view, the function-based indexes significantly outperform the date_dimension. The only performance-related drawback so far is the extra indexes to update during the fact table’s loading, as well as the additional 10-or-so indexes required for every date column in the fact table.

But it is query complexity where the date dimension really shines. Although I agree that most of the time, everybody is interested only in what happened in the last 30 days or year-to-date, which is easy SQL to write. What about something never specified in the specs? What about when some marketing guy gets on the datamart and wants to compare sales on holidays that were on weekend days versus sales on holidays that were on weekdays? Without a pre-populated date dimension, this person would have to figure out which holidays they were interested in, and then historically tie them to dates. That’s relatively easy for Christmas and New Years, but not so obvious for Thanksgiving, and even worse for Good Friday and Easter Sunday.

So even though the date_dimension would be slower, it would take more time to actually write the query otherwise than to run it. This is what it looks like:

select holiday, weekday, sum(amount) from fact f, date_dimension d where f.date_id = d.date_id
and holiday is not null
group by holiday, weekday;

So what are we to do? Like most things, you need a little bit of both. I would have only the regular date column and setup the reporting tool to join the straight date or an index on trunc(date), if it constrains a time component, and a date_id joining to the date_dimension for more complex queries.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

2 Comments. Leave new

Vyacheslav Rasskazov
June 3, 2008 10:12 pm

Hi, Robert
Am I correct in thinking that there is little error at date dimension-based query for a specific date range. Condition f.date_id >= d.date_id must be changed for f.date_id = d.date_id. Otherwise, comparision is not correct, because two queries simply produced different results.

Reply
Polarski Bernard
June 4, 2008 1:19 am

The article seems interresting butI gave up. The enclosed SQL output font are so small I can barely read. Setting the character to biggest does not affect this part of the text. I bet I am not the only one to be affected.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *