Multi-Column Correlation and Extended Stats in Oracle 11g

Posted in: Technical Track

We all have encountered this situation many times before: the cost-based optimizer assumes no correlation between two columns (until 11g), and this has the effect of erroneously reducing the cardinality of a row source. Incorrect cardinality estimates are one of many root causes for SQL performance issues. Consider the following example.

This code creates a table and populates data with four columns:

create table t_vc as
select mod(n, 100) n1, mod(n, 100) n2 ,
mod(n, 50) n3 , mod(n, 20) n4
from (select level n from dual connect by level <= 10001);

The first two columns, n1 and n2, have a strong correlation: n1 is always equals to n2.

Let’s collect statistics with histograms on all columns.

begin
dbms_stats.gather_Table_stats( user, 'T_VC', estimate_percent => null, method_opt => 'for all columns size 254');
end;
/

Let’s explain the plan for the query, specifying one predicate, exactly 100 rows with a value of 10.

explain plan for select count(*) from t_vc where n1=10;

----------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     3 |     9   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     3 |            |          |
|*  2 |   TABLE ACCESS FULL| T_VC |   100 |   300 |     9   (0)| 00:00:01 |
----------------------------------------------------

Yes, the optimizer is able to estimate the cardinality as 100 from row source T_VC at step 2.

Now, let’s add another predicate, n2=10. Since there is a strong correlation between n1 and n2, adding a predicate such as n2=10 should not alter row source cardinality, right?

explain plan for select count(*) from t_vc where n1=10 and n2=10;

select * from table(dbms_xplan.display);

----------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |

----------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     6 |     9   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     6 |            |          |
|*  2 |   TABLE ACCESS FULL| T_VC |     1 |     6 |     9   (0)| 00:00:01 |
----------------------------------------------------

Nope, the optimizer cardinality is far off and 100 times lower than the correct cardinality.

Notice the rows column at step 2 above and the CBO estimate is just 1. Essentially, the CBO estimate boils down to
a simplified formula:

# of rows~= total # of rows * (1/NDV for n1) * (1/NDV for n2)
= 10000 * (1/100) * (1/100) =1 row.

where selectivity is (1/NDV) in simple case, without considering histograms. It is far from the truth. There are at least 100 rows with n1=10 and n2=10. In complex SQL, this incorrect cardinality estimation will lead to an inefficient access plan.

Extended Stats

Oracle 11g introduces extended stats to relieve some pain. In 11g, extended stats can be added between columns, enabling the CBO to consider correlation between these column values.

SELECT dbms_stats.create_extended_stats(
ownname=>user, tabname => 'T_VC',
extension => '(n1, n2)' ) AS n1_n2_correlation
FROM dual;
N1_n2_correlation
---------------------------------------------------
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS

Let’s collect stats again on this table and check the SQL plan.

exec dbms_stats.gather_Table_stats( user, 'T_VC', estimate_percent => null, method_opt => 'for all columns size 254');

explain plan for select count(*) from t_vc where n1=10 and n2=10;

----------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time               |
----------------------------------------------------
| 0  | SELECT STATEMENT |        | 100 | 1200 | 9 (0)| 00:00:01          |
|* 1 | TABLE ACCESS FULL| T_VC   | 100 | 1200 | 9 (0)| 00:00:01
----------------------------------------------------

At last, in 11g, the optimizer has truly understood the correlation.

Under the Hood

1. Adding an extended stats adds a new virtual column to the table. Here is the line from sqltrace.

alter table "CBQT"."T_VC" add (SYS_STUBZH0IHA7K$KEBJVXO5LOHAS
       as (sys_op_combined_hash(n1, n2)) virtual BY USER for statistics);

The virtual column name is cryptic — it seems to have been derived from table_name, column_name combinations. This is why we reanalyzed the table.

2. A new deterministic hash function, sys_op_combined_hash, is called by optimizer to populate this virtual column values. This deterministic function returns same value for unique combination of arguments passed.

col h1 format 99999999999999999999999999
select sys_op_combined_hash  (1,1) h1 from dual;
H1
----------------------------
7026129190895635777

select sys_op_combined_hash  (1,2) h1 from dual;
H1
----------------------------
298332787864732998

Collecting histograms on all columns collects histograms on this virtual column also.

Trace Lines

Using the above histogram, the CBO is able to find that there is a strong correlation between these two columns. This is visible in the 10053 output.

SINGLE TABLE ACCESS PATH
  Single Table Cardinality Estimation for T_VC[T_VC]
  Column (#1):
    NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100
  Column (#2):
    NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100
  Column (#5):
    NewDensity:0.005000, OldDensity:0.000050 BktCnt:10001, PopBktCnt:10001, PopValCnt:100, NDV:100
  ColGroup (#1, VC) SYS_STUBZH0IHA7K$KEBJVXO5LOHAS
    Col#: 1 2     CorStregth: 100.00
  ColGroup Usage:: PredCnt: 2  Matches Full: #0  Partial:  Sel: 0.0100
  Table: T_VC  Alias: T_VC
    Card: Original: 10001.000000  Rounded: 100  Computed: 100.00  Non Adjusted: 100.00
  Access Path: TableScan
    Cost:  9.11  Resp: 9.11  Degree: 0
      Cost_io: 9.00  Cost_cpu: 2404620
      Resp_io: 9.00  Resp_cpu: 2404620
  Best:: AccessPath: TableScan
         Cost: 9.11  Degree: 1  Resp: 9.11  Card: 100.00  Bytes: 0

Notice the ColGroup line and CorStrength field above. It is set to 100. CorStrength is calculated using histograms for the virtual column and final cardinality estimates are multiplied by CorStrength.

# of rows~= total # of rows * (1/NDV for n1) * (1/NDV for n2)*corStrength
= 10000 * (1/100) * (1/100)*100 =100 rows.

The cardinality estimates match reality, at last. In the next installment, I will discuss this further.

Okay, it’s time to introduce myself. I specialize in Oracle performance tuning, Oracle internals, and the E-Business Suite. I have over 15 years of experience as an Oracle DBA. I am regular presenter in major conferences such as HOTSOS, UKOUG, RMOUG, etc., and I am also an Oak Table member. Some of my papers can be found on my personal blog: my papers and presentations.

email
Want to talk with an expert? Schedule a call with our team to get the conversation started.

8 Comments. Leave new

Log Buffer #92: a Carnival of the Vanities for DBAs
April 11, 2008 11:45 am

[…] A new blogger made his first appearance here this week: Riyaj Shamsudeen, with his item on multi-column correlation and extended stats in Oracle 11g. […]

Reply
Alex Fatkulin
April 15, 2008 3:47 pm

Nice post!

I would like to add that in 10G you still can battle correlation problems using higher levels of dynamic sampling (to certain degree).

In 10.2.0.3 the correct cardinality was reported on level 4:

SQL> select /*+ dynamic_sampling(4) */ count(*) from t_vc where n1=10 and n2=10;

Execution Plan
———————————————————-
Plan hash value: 2872596891

—————————————————————————
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
—————————————————————————
| 0 | SELECT STATEMENT | | 1 | 6 | 8 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 6 | | |
|* 2 | TABLE ACCESS FULL| T_VC | 100 | 600 | 8 (0)| 00:00:01 |
—————————————————————————

Predicate Information (identified by operation id):
—————————————————

2 – filter(“N1″=10 AND “N2″=10)

Note
—–
– dynamic sampling used for this statement

Reply
Oracle 11g: Multi-Column Correlation Without Extended Stats
November 24, 2008 3:55 pm

[…] database one of the problems I faced with 10g—one on those problems Riyaj described in his Multi-Column Correlation and Extended Stats in Oracle 11g post. And the fun part is that I wasn’t able to reproduce it. Yet just setting […]

Reply
Correlation, nocorrelation and extended stats « Oracle database internals by Riyaj
December 18, 2008 7:05 pm

[…] tuning — orainternals @ 12:05 am I blogged about extended stats in my earlier blog, extended stats, and also documented that as an investigation in Investigations: extended stats and multi-column […]

Reply

thanks , useful info presented succinctly

Reply

Nice Work

Reply

I’m not sure it takes CorStregth into account while calculating cardinality.
Or at least formula should be more complicated.

Let’s create a table and check a couple of queries.

create table t(id number, n number);
insert into t select mod(rownum,10), mod(rownum,5) from dual connect by level 'FOR ALL COLUMNS SIZE 254');


select count(*) from t where id = 1 and n = 1;
select count(*) from t where id = 1 and n = 2;

According to 10053 output, “CorStregth” remains the same in both cases and formula is something like “total # of rows * Sel”.


Col#: 1 2 CorStregth: 5.00
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0500
Table: T Alias: T
Card: Original: 1000.000000 Rounded: 50 Computed: 50.00 Non Adjusted: 50.00

Col#: 1 2 CorStregth: 5.00
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.1000
Table: T Alias: T
Card: Original: 1000.000000 Rounded: 100 Computed: 100.00 Non Adjusted: 100.00

It’s not clear how Sel is derived though.

Reply

For some reason code was corrupted.

create table t(id number, n number);
insert into t select mod(rownum,10), mod(rownum,5) from dual connect by level 'FOR ALL COLUMNS SIZE 254');

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *