Multi-Column Correlation and Extended Stats in Oracle 11g

Riyaj Shamsudeen

April 9, 2008

We all have encountered this situation many times before: the cost-based optimizer assumes no correlation between two columns (until 11g), and this has the effect of erroneously reducing the cardinality of a row source. Incorrect cardinality estimates are one of many root causes for SQL performance issues. Consider the following example.

This code creates a table and populates data with four columns:

create table t_vc as
select mod(n, 100) n1, mod(n, 100) n2 ,
mod(n, 50) n3 , mod(n, 20) n4
from (select level n from dual connect by level <= 10001);

The first two columns, n1 and n2, have a strong correlation: n1 is always equals to n2.

Let’s collect statistics with histograms on all columns.

begin
dbms_stats.gather_Table_stats( user, 'T_VC', estimate_percent => null, method_opt => 'for all columns size 254');
end;
/

Let’s explain the plan for the query, specifying one predicate, exactly 100 rows with a value of 10.

explain plan for select count(*) from t_vc where n1=10;

----------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     3 |     9   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |
|*  2 |   TABLE ACCESS FULL| T_VC |   100 |   300 |     9   (0)| 00:00:01 |
----------------------------------------------------

Yes, the optimizer is able to estimate the cardinality as 100 from row source T_VC at step 2.

Now, let’s add another predicate, n2=10. Since there is a strong correlation between n1 and n2, adding a predicate such as n2=10 should not alter row source cardinality, right?

explain plan for select count(*) from t_vc where n1=10 and n2=10;

select * from table(dbms_xplan.display);

----------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |

----------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     6 |     9   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS FULL| T_VC |     1 |     6 |     9   (0)| 00:00:01 |
----------------------------------------------------

Nope, the optimizer cardinality is far off and 100 times lower than the correct cardinality.

Notice the rows column at step 2 above and the CBO estimate is just 1. Essentially, the CBO estimate boils down to
a simplified formula:

# of rows~= total # of rows * (1/NDV for n1) * (1/NDV for n2)
= 10000 * (1/100) * (1/100) =1 row.

where selectivity is (1/NDV) in simple case, without considering histograms. It is far from the truth. There are at least 100 rows with n1=10 and n2=10. In complex SQL, this incorrect cardinality estimation will lead to an inefficient access plan.

Extended Stats

Oracle 11g introduces extended stats to relieve some pain. In 11g, extended stats can be added between columns, enabling the CBO to consider correlation between these column values.

SELECT dbms_stats.create_extended_stats(
ownname=>user, tabname => 'T_VC',
extension => '(n1, n2)' ) AS n1_n2_correlation
FROM dual;
N1_n2_correlation
---------------------------------------------------
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS

Letâ€™s collect stats again on this table and check the SQL plan.

exec dbms_stats.gather_Table_stats( user, 'T_VC', estimate_percent => null, method_opt => 'for all columns size 254');

explain plan for select count(*) from t_vc where n1=10 and n2=10;

----------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time               |
----------------------------------------------------
| 0  | SELECT STATEMENT |        | 100 | 1200 | 9 (0)| 00:00:01          |
|* 1 | TABLE ACCESS FULL| T_VC   | 100 | 1200 | 9 (0)| 00:00:01
----------------------------------------------------

At last, in 11g, the optimizer has truly understood the correlation.

Under the Hood

1. Adding an extended stats adds a new virtual column to the table. Here is the line from sqltrace.

alter table "CBQT"."T_VC" add (SYS_STUBZH0IHA7K$KEBJVXO5LOHAS
       as (sys_op_combined_hash(n1, n2)) virtual BY USER for statistics);

The virtual column name is cryptic — it seems to have been derived from table_name, column_name combinations. This is why we reanalyzed the table.

2. A new deterministic hash function, sys_op_combined_hash, is called by optimizer to populate this virtual column values. This deterministic function returns same value for unique combination of arguments passed.

col h1 format 99999999999999999999999999
select sys_op_combined_hash  (1,1) h1 from dual;
H1
----------------------------
7026129190895635777

select sys_op_combined_hash  (1,2) h1 from dual;
H1
----------------------------
298332787864732998

Collecting histograms on all columns collects histograms on this virtual column also.

Trace Lines

Using the above histogram, the CBO is able to find that there is a strong correlation between these two columns. This is visible in the 10053 output.

SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for T_VC[T_VC]
  Column (#1):
    NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100
  Column (#2):
    NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100
  Column (#5):
    NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100
  ColGroup (#1, VC) SYS_STUBZH0IHA7K$KEBJVXO5LOHAS
    Col#: 1 2     CorStregth: 100.00
  ColGroup Usage:: PredCnt: 2  Matches Full: #0  Partial:  Sel: 0.0100
  Table: T_VC  Alias: T_VC
    Card: Original: 10001.000000  Rounded: 100  Computed: 100.00  Non Adjusted: 100.00
  Access Path: TableScan
    Cost:  9.11  Resp: 9.11  Degree: 0
      Cost_io: 9.00  Cost_cpu: 2404620
      Resp_io: 9.00  Resp_cpu: 2404620
  Best:: AccessPath: TableScan
         Cost: 9.11  Degree: 1  Resp: 9.11  Card: 100.00  Bytes: 0

Notice the ColGroup line and CorStrength field above. It is set to 100. CorStrength is calculated using histograms for the virtual column and final cardinality estimates are multiplied by CorStrength.

# of rows~= total # of rows * (1/NDV for n1) * (1/NDV for n2)*corStrength
= 10000 * (1/100) * (1/100)*100 =100 rows.

The cardinality estimates match reality, at last. In the next installment, I will discuss this further.

Okay, it’s time to introduce myself. I specialize in Oracle performance tuning, Oracle internals, and the E-Business Suite. I have over 15 years of experience as an Oracle DBA. I am regular presenter in major conferences such as HOTSOS, UKOUG, RMOUG, etc., and I am also an Oak Table member. Some of my papers can be found on my personal blog: my papers and presentations.

Insight and analysis of technology and business strategy

Multi-Column Correlation and Extended Stats in Oracle 11g

Extended Stats

Under the Hood

Trace Lines

Top Categories

Tell us how we can help!