The purpose of this post is to verify if a date dimension is better in regards to performance and functionality than a series of function-based indexes on a date column in the fact table.
The following is the date dimension and fact table used for the testing:
Date Dimension create table date_dimension( date_id number, --1..x days sql_date date, -- trunc(date) day_of_week varchar2(30), -- (long version) day_of_week_short varchar2(3), -- (short version) day_number_in_week number, -- (1..7) day_number_in_month number, --(1..31) day_number_in_year number, --(1..366) workday varchar2(10), --(either workday or holiday) holiday varchar2(50), --(specify text e.g. christmas, easter...) weekday varchar2(10), -- (specify weekday or weekend) last_day_in_month number, -- (28,29,30 or 31) week_number_in_year number , --1..52 week_number_in_month number, --1,5 month varchar2(10) ,--(January...) month_short varchar2(3), -- (Jan...) month_number_in_year number, -- (1..12) quarter_in_year number, -- (1,2,3,4) year number, -- (4 digits) chinesezodiac varchar2(50), zodiac varchar2(50) , event varchar2(100) -- (election day,Hurricane Hugo) text field ) ;
The are rows for every day from the year 1850 to 2050 (way after I retire). There is a primary key on date_id
, a unique key on sql_date
, and bitmap indexes on all other columns. The table is compressed and its stats computed.
Segment sizes: WEEK_NUMBER_IN_YEAR_IDX .125 MONTH_IDX .0625 MONTH_SHORT_IDX .0625 MONTH_NUMBER_IN_YEAR_IDX .0625 QUARTER_IN_YEAR_IDX .0625 YEAR_IDX .0625 EVENT_IDX .0625 WEEK_NUMBER_IN_MONTH_IDX .125 CHINESEZODIAC_IDX .0625 ZODIAC_IDX .0625 WORKDAY_IDX .0625 HOLIDAY_IDX .0625 WEEKDAY_IDX .0625 LAST_DAY_IN_MONTH_IDX .0625 DATE_DIMENSION_SQL_DATE 2 DATE_DIMENSION_PK 2 DAY_OF_WEEK_IDX .1875 DAY_OF_WEEK_SHORT_IDX .1875 DAY_NUMBER_IN_WEEK_IDX .1875 DAY_NUMBER_IN_MONTH_IDX .1875 DAY_NUMBER_IN_YEAR_IDX .3125 Fact Table create table fact (id number, tx_date_dm_id number, tx_date date, amount number);
There are 1 million rows in this table, with a compressed index on the date_id
and tx_date
(since it’s truncated to the day), as well as the following function-based indexes on the tx_date
column:
create index fact_tx_date_mm on fact( to_char(tx_date,'mm')) ; create index fact_tx_date_yyyy on fact( trunc(tx_date,'yyyy')) ; create index fact_tx_date_wnim on fact( to_char(tx_date,'w')) ; create index fact_tx_date_dow on fact( to_char(tx_date,'dy')) ;
Segment Sizes:
FACT 32 FACT_TX_DATE_YYYY .25 FACT_TX_DATE_WNIM .25 FACT_TX_DATE_DOW .25 FACT_TX_DATE_MM .25 FACT_TX_DATE_IDX 13 FACT_DATE_ID_IDX 13
So let’s start looking at typical queries and see how they perform.
Query 1: sum of amount for a specific date range
The fact date column-based query:
select sum(amount), count(*) from fact f where f.tx_date between to_date('20070501','yyyymmdd') and to_date('20070601','yyyymmdd'); SUM(AMOUNT) COUNT(*) ----------- ---------- 9.8095E+10 33251 Elapsed: 00:00:00.38 Execution Plan ---------------------------------------------------------- Plan hash value: 2208459562 ------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 21 | 324 (1)| 00:00:04 | | 1 | SORT AGGREGATE | | 1 | 21 | | | | 2 | TABLE ACCESS BY INDEX ROWID| FACT | 34530 | 708K| 324 (1)| 00:00:04 | |* 3 | INDEX RANGE SCAN | FACT_TX_DATE_IDX | 34530 | | 57 (2)| 00:00:01 | ------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("F"."TX_DATE">=TO_DATE('2007-05-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND "F"."TX_DATE"<=TO_DATE('2007-06-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss')) Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 309 consistent gets 107 physical reads 0 redo size 594 bytes sent via SQL*Net to client 469 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed
The date dimension-based query:
select sum(amount), count(*) from fact f, date_dimension d where f.date_id = d.date_id and sql_date between to_date('20070501','yyyymmdd') and to_date('20070601','yyyymmdd'); SUM(AMOUNT) COUNT(*) ----------- ---------- 1.1881E+12 400584 Elapsed: 00:00:29.95 Execution Plan ---------------------------------------------------------- Plan hash value: 182493474 ---------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 31 | 116 (1)| 00:00:02 | | 1 | SORT AGGREGATE | | 1 | 31 | | | | 2 | TABLE ACCESS BY INDEX ROWID | FACT | 1437 | 25866 | 15 (0)| 00:00:01 | | 3 | NESTED LOOPS | | 47414 | 1435K| 116 (1)| 00:00:02 | | 4 | TABLE ACCESS BY INDEX ROWID| DATE_DIMENSION | 33 | 429 | 4 (0)| 00:00:01 | |* 5 | INDEX RANGE SCAN | DATE_DIMENSION_SQL_DATE | 33 | | 2 (0)| 00:00:01 | |* 6 | INDEX RANGE SCAN | FACT_DATE_ID_IDX | 1437 | | 3 (0)| 00:00:01 | ---------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 5 - access("SQL_DATE">=TO_DATE('2007-05-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND "SQL_DATE"<=TO_DATE('2007-06-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss')) 6 - access("F"."DATE_ID"="D"."DATE_ID") Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 415 consistent gets 0 physical reads 0 redo size 594 bytes sent via SQL*Net to client 469 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed
Query 2: sum of amount for all Thanksgiving days
The fact-only-based query:
select sum(amount), count(*) from fact where to_char(tx_date,'dy') = 'thu' and to_char(tx_date,'mm')='11' and to_char(tx_date,'w')='4'; Elapsed: 00:00:00.07 Execution Plan ---------------------------------------------------------- Plan hash value: 1038184904 --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 21 | 374 (1)| 00:00:05 | | 1 | SORT AGGREGATE | | 1 | 21 | | | | 2 | TABLE ACCESS BY INDEX ROWID | FACT | 2381 | 50001 | 374 (1)| 00:00:05 | | 3 | BITMAP CONVERSION TO ROWIDS | | | | | | | 4 | BITMAP AND | | | | | | |* 5 | BITMAP INDEX SINGLE VALUE | FACT_TX_DATE_MM | | | | | |* 6 | BITMAP INDEX SINGLE VALUE | FACT_TX_DATE_DOW | | | | | |* 7 | BITMAP INDEX SINGLE VALUE | FACT_TX_DATE_WNIM | | | | | --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 5 - access(TO_CHAR(INTERNAL_FUNCTION("TX_DATE"),'mm')='11') 6 - access(TO_CHAR(INTERNAL_FUNCTION("TX_DATE"),'dy')='thu') 7 - access(TO_CHAR(INTERNAL_FUNCTION("TX_DATE"),'w')='4') Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 39 consistent gets 0 physical reads 0 redo size 591 bytes sent via SQL*Net to client 469 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed
The date-dimension-based query:
select sum(amount), count(*) from fact f, date_dimension d where f.date_id = d.date_id and d.holiday = 'Thanksgiving'; Elapsed: 00:00:00.04 Execution Plan ---------------------------------------------------------- Plan hash value: 688750045 ---------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 25 | 758 (2) | 00:00:10 | | 1 | SORT AGGREGATE | | 1 | 25 | | | | 2 | TABLE ACCESS BY INDEX ROWID | FACT | 1437 | 25866 | 15 (0) | 00:00:01 | | 3 | NESTED LOOPS | | 302K | 7383K | 758 (2) | 00:00:10 | | 4 | TABLE ACCESS BY INDEX ROWID | DATE_DIMENSION | 211 | 1477 | 94 (2) | 00:00:02 | | 5 | BITMAP CONVERSION TO ROWIDS | | | | | | |* 6 | BITMAP INDEX SINGLE VALUE | HOLIDAY_IDX | | | | | |* 7 | INDEX RANGE SCAN | FACT_DATE_ID_IDX | 1437 | | 3 (0) | 00:00:01 | ---------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 6 - access("D"."HOLIDAY"='Thanksgiving') 7 - access("F"."DATE_ID"="D"."DATE_ID") Statistics ---------------------------------------------------------- 410 recursive calls 0 db block gets 742 consistent gets 0 physical reads 116 redo size 591 bytes sent via SQL*Net to client 469 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 5 sorts (memory) 0 sorts (disk) 1 rows processed
So far, from a query performance point of view, the function-based indexes significantly outperform the date_dimension
. The only performance-related drawback so far is the extra indexes to update during the fact table’s loading, as well as the additional 10-or-so indexes required for every date column in the fact table.
But it is query complexity where the date dimension really shines. Although I agree that most of the time, everybody is interested only in what happened in the last 30 days or year-to-date, which is easy SQL to write. What about something never specified in the specs? What about when some marketing guy gets on the datamart and wants to compare sales on holidays that were on weekend days versus sales on holidays that were on weekdays? Without a pre-populated date dimension, this person would have to figure out which holidays they were interested in, and then historically tie them to dates. That’s relatively easy for Christmas and New Years, but not so obvious for Thanksgiving, and even worse for Good Friday and Easter Sunday.
So even though the date_dimension
would be slower, it would take more time to actually write the query otherwise than to run it. This is what it looks like:
select holiday, weekday, sum(amount) from fact f, date_dimension d where f.date_id = d.date_id and holiday is not null group by holiday, weekday;
So what are we to do? Like most things, you need a little bit of both. I would have only the regular date
column and setup the reporting tool to join the straight date or an index on trunc(date)
, if it constrains a time component, and a date_id
joining to the date_dimension
for more complex queries.
2 Comments. Leave new
Hi, Robert
Am I correct in thinking that there is little error at date dimension-based query for a specific date range. Condition f.date_id >= d.date_id must be changed for f.date_id = d.date_id. Otherwise, comparision is not correct, because two queries simply produced different results.
The article seems interresting butI gave up. The enclosed SQL output font are so small I can barely read. Setting the character to biggest does not affect this part of the text. I bet I am not the only one to be affected.