Oracle Parallel Query Hints – Part 1: PQ_DISTRIBUTE

Posted in: Oracle, Technical Track

One of the most powerful features of the Oracle database is its ability to leverage multiple cores spread across many RAC servers to execute a single query. Oracle’s optimizer tries its best to execute queries in the most efficient manner, but often the best distribution of data is not obvious from the statistics alone – and some hints are needed to help out.

The most important hint is PQ_DISTRIBUTE, but it’s also the most tricky to use as it depends on so many other hints. The Oracle documentation offers the bare minimum on usage, and there are no “Oracle Support” articles that help. For something so complex and important, it’s a rather big miss from Oracle. There should be examples, use cases etc. But then again .. Microsoft’s SQL Server is not any better in describing control nobs for parallel query executions. So we have to rely on the community for those.

Let’s start with basics (from the docs):

There are two distinct use cases, that are covered by the same HINT. I am sure there’s a reason for that, but it would’ve been much clearer for everyone if there were two different hints.

Use case #1 (“distribution” case): To control the reshuffling of data during parallel DML and direct path loads.
Use case #2 (“outer_distribution, inner_distribution”): To control the reshuffling of data for table joins.

Those two are very distinct cases, even though they both involve re-shuffling data between parallel query processes.

I will cover Case #2 in this blog, as it’s easier to demonstrate and test. The DML cases will be in a future blog in this series.

First the documentation is very ambiguous on what is considered “outer” and what is “inner” distributions. But basically the “outer” table is the one mentioned in the PQ_DISTRIBUTE hint. The “inner” table the “current result set”.

In this case – 1 query joining 2 tables – both are identical in content, so neither is “obvious” for broadcast, so the default is hash distribution for the join, which unfortunately results in a BUFFERED join. This means that the result set will be “staged” before it is returned to the application, causing significant extra work to be done.


select /*+PARALLEL(8) */* 
from tlarge t1
join tlarge_two t2 on t2.id=t1.id


----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name       | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |            |    100K|   105M|   638   (0)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                 |            |        |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10002   |    100K|   105M|   638   (0)| 00:00:01 |  Q1,02 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED           |            |    100K|   105M|   638   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   4 |     PX RECEIVE                  |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   5 |      PX SEND HASH               | :TQ10000   |    100K|    52M|   319   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   6 |       PX BLOCK ITERATOR         |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS STORAGE FULL| TLARGE     |    100K|    52M|   319   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|   8 |     PX RECEIVE                  |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   9 |      PX SEND HASH               | :TQ10001   |    100K|    52M|   319   (0)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  10 |       PX BLOCK ITERATOR         |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|  11 |        TABLE ACCESS STORAGE FULL| TLARGE_TWO |    100K|    52M|   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------

Even if we add a filter condition to one of the tables, that will result in significant reduction of rows (the mod5_id=1 reduces the data by 5x), Oracle still uses the HASH HASH distribution

The NO_PX_JOIN_FILTER hint will be explained in a future blog of the series. Here it only helps to simplify the examples.

select /*+PARALLEL(8) NO_PX_JOIN_FILTER(T2) */* 
from tlarge t1
join tlarge_two t2 on t2.id=t1.id
where t1.mod5_id=1

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name       | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |            |  20000 |    21M|   638   (0)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                 |            |        |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10002   |  20000 |    21M|   638   (0)| 00:00:01 |  Q1,02 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED           |            |  20000 |    21M|   638   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   4 |     PX RECEIVE                  |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   5 |      PX SEND HASH               | :TQ10000   |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   6 |       PX BLOCK ITERATOR         |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|*  7 |        TABLE ACCESS STORAGE FULL| TLARGE     |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|   8 |     PX RECEIVE                  |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   9 |      PX SEND HASH               | :TQ10001   |    100K|    52M|   319   (0)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  10 |       PX BLOCK ITERATOR         |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|  11 |        TABLE ACCESS STORAGE FULL| TLARGE_TWO |    100K|    52M|   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------

And now switching the distribution method to BROADCAST
Note1 the NO_PX_JOIN_FILTER is again to simplify the example and will be discussed further in the series.
Note2 the NO_PQ_REPLICATE hint is used to to demonstrate what is happening behind the scenes, and is explained at the end of the blog.
:

select /*+PARALLEL(8) NO_PX_JOIN_FILTER(T2) NO_PQ_REPLICATE(T2) PQ_DISTRIBUTE(T2 BROADCAST NONE)*/* 
from tlarge t1
join tlarge_two t2 on t2.id=t1.id
where t1.mod5_id=1
;

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name       | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |            |  20000 |    21M|   638   (0)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                 |            |        |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10001   |  20000 |    21M|   638   (0)| 00:00:01 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN                    |            |  20000 |    21M|   638   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE                  |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX SEND BROADCAST          | :TQ10000   |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,00 | P->P | BROADCAST  |
|   6 |       PX BLOCK ITERATOR         |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|*  7 |        TABLE ACCESS STORAGE FULL| TLARGE     |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|   8 |     PX BLOCK ITERATOR           |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|   9 |      TABLE ACCESS STORAGE FULL  | TLARGE_TWO |    100K|    52M|   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------

Notice the BUFFERED hash join is gone, and now we have a broadcast operator. This plan is much more efficient, but consumes a bit more memory to execute. Each of the 8 parallel processes would have a copy of its own TLARGE table, but AFTER filtering on mod5_id. You can see this is estimated at 10 MB * 8 processes = 80 MB total.

Back on topic – the PQ_DISTRIBUTE hint here says following. When you join the T2 table, take the current result set and broadcast it to all processes that are about to read T2.

If you want to switch and have the TLARGE_TWO table to be broadcasted and have a similar plan, you need a lot more hints.

The query looks like this:

select /*+PARALLEL(8) NO_PX_JOIN_FILTER(T2) NO_PQ_REPLICATE(T1) PQ_DISTRIBUTE(T1 BROADCAST NONE) LEADING(t2 t1)*/* 
from tlarge t1
join tlarge_two t2 on t2.id=t1.id
where t1.mod5_id=1
;
----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name       | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |            |  20000 |    21M|   638   (0)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                 |            |        |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10001   |  20000 |    21M|   638   (0)| 00:00:01 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN                    |            |  20000 |    21M|   638   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE                  |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX SEND BROADCAST          | :TQ10000   |    100K|    52M|   319   (0)| 00:00:01 |  Q1,00 | P->P | BROADCAST  |
|   6 |       PX BLOCK ITERATOR         |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS STORAGE FULL| TLARGE_TWO |    100K|    52M|   319   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|   8 |     PX BLOCK ITERATOR           |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|*  9 |      TABLE ACCESS STORAGE FULL  | TLARGE     |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------

Explanation:

1. You need to change the join order – so that T1 is joined SECOND. If it is first, then “T1” is never “joined”
2. You need to change the PQ_DISTRIBUTE so that it has instructions on what to do when joining T1 to the result set.

Now interestingly, you can do the exact same thing in a slightly different fashion. See query bellow:

select /*+PARALLEL(8) NO_PX_JOIN_FILTER(T2) NO_PQ_REPLICATE(T2) PQ_DISTRIBUTE(T2 NONE BROADCAST ) LEADING(t1 t2) SWAP_JOIN_INPUTS(t2)*/* 
from tlarge t1
join tlarge_two t2 on t2.id=t1.id
where t1.mod5_id=1
;

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name       | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |            |  20000 |    21M|   638   (0)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                 |            |        |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10001   |  20000 |    21M|   638   (0)| 00:00:01 |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN                    |            |  20000 |    21M|   638   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE                  |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX SEND BROADCAST          | :TQ10000   |    100K|    52M|   319   (0)| 00:00:01 |  Q1,00 | P->P | BROADCAST  |
|   6 |       PX BLOCK ITERATOR         |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS STORAGE FULL| TLARGE_TWO |    100K|    52M|   319   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|   8 |     PX BLOCK ITERATOR           |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|*  9 |      TABLE ACCESS STORAGE FULL  | TLARGE     |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------

You will notice that the plan is identical. However the way we reach is very different and very important to understand. This one goes like this:

1. Join order remains the same – T2 is joined to solo result set of T1
2. The PQ_DISTRIBUTE hint now has “NONE BROADCAST” which says: When you join T2 do nothing with the current result set, but broadcast T2 to every process
3. SWAP_JOIN_INPUTS – instructs the optimizer to reverse the order of which table is hashed into memory, and which table is then scanned while probing the in memory table.

What’s important to understand is that when we have 2 tables, the end result is the same with these different approaches, because “the current result set” is just the other table. But with 3 or more tables “the current result set” can be 2 tables joined.

Let’s illustrate with a query:

select /*+PARALLEL(8) NO_PX_JOIN_FILTER(T2) NO_PQ_REPLICATE(T2) */* 
from tlarge t1
join tlarge_two t2 on t2.id=t1.id
join tsmall ts on ts.id = t1.id
where t1.mod5_id=1
;

-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name       | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |            |  10000 |    15M|   671   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                  |            |        |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)            | :TQ10003   |  10000 |    15M|   671   (1)| 00:00:01 |  Q1,03 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED            |            |  10000 |    15M|   671   (1)| 00:00:01 |  Q1,03 | PCWP |            |
|*  4 |     HASH JOIN                    |            |  10000 |    10M|   352   (1)| 00:00:01 |  Q1,03 | PCWP |            |
|   5 |      PX RECEIVE                  |            |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|   6 |       PX SEND HASH               | :TQ10000   |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   7 |        PX BLOCK ITERATOR         |            |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS STORAGE FULL| TSMALL     |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|   9 |      PX RECEIVE                  |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|  10 |       PX SEND HASH               | :TQ10001   |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  11 |        PX BLOCK ITERATOR         |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|* 12 |         TABLE ACCESS STORAGE FULL| TLARGE     |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|  13 |     PX RECEIVE                   |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|  14 |      PX SEND HASH                | :TQ10002   |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | P->P | HASH       |
|  15 |       PX BLOCK ITERATOR          |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | PCWC |            |
|  16 |        TABLE ACCESS STORAGE FULL | TLARGE_TWO |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------------------

With the default, Oracle chooses HASH HASH for each join. We have only 1 hash join because the join conditions are on the same column, but if there were slightly different – there would be two BUFFERED hash joins and 2 re-shuffles.

I’ve fabricated this by just adding a “+0” to the join condition:

select /*+PARALLEL(8) NO_PX_JOIN_FILTER(T2) NO_PQ_REPLICATE(T2) */* 
from tlarge t1
join tlarge_two t2 on t2.id=t1.id
join tsmall ts on ts.id = t1.id+0
where t1.mod5_id=1
;

-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name       | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |            |  10000 |    15M|   671   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                    |            |        |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)              | :TQ10004   |  10000 |    15M|   671   (1)| 00:00:01 |  Q1,04 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED              |            |  10000 |    15M|   671   (1)| 00:00:01 |  Q1,04 | PCWP |            |
|   4 |     PX RECEIVE                     |            |  10000 |    10M|   352   (1)| 00:00:01 |  Q1,04 | PCWP |            |
|   5 |      PX SEND HASH                  | :TQ10002   |  10000 |    10M|   352   (1)| 00:00:01 |  Q1,02 | P->P | HASH       |
|*  6 |       HASH JOIN BUFFERED           |            |  10000 |    10M|   352   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   7 |        PX RECEIVE                  |            |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   8 |         PX SEND HASH               | :TQ10000   |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   9 |          PX BLOCK ITERATOR         |            |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS STORAGE FULL| TSMALL     |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  11 |        PX RECEIVE                  |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  12 |         PX SEND HASH               | :TQ10001   |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  13 |          PX BLOCK ITERATOR         |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|* 14 |           TABLE ACCESS STORAGE FULL| TLARGE     |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|  15 |     PX RECEIVE                     |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,04 | PCWP |            |
|  16 |      PX SEND HASH                  | :TQ10003   |    100K|    52M|   319   (0)| 00:00:01 |  Q1,03 | P->P | HASH       |
|  17 |       PX BLOCK ITERATOR            |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,03 | PCWC |            |
|  18 |        TABLE ACCESS STORAGE FULL   | TLARGE_TWO |    100K|    52M|   319   (0)| 00:00:01 |  Q1,03 | PCWP |            |
-------------------------------------------------------------------------------------------------------------------------------

The extra re-shuffling is at step 5 – where the PX SEND HASH is after the HASH JOIN BUFFERED (going up the tree).

The reason why there is no extra re-shuffling in the original case, is because “the current result set” is already distributed to each processing server by the join key, so only “new” tables being joined need to be re-distributed. Let’s continue to explore the original 3 table case

Let’s try to introduce BROADCAST there. The JOIN order is: TS, T1, T2

The logical thing to do is to say formulate PQ_DISTRIBUTE as follow: When you Join T1, please broadcast it. Unfortunately this results in the following:

select /*+PARALLEL(8) NO_PX_JOIN_FILTER(T2) NO_PQ_REPLICATE(T1) NO_PQ_REPLICATE(T2) PQ_DISTRIBUTE(T1 NONE BROADCAST) */* 
from tlarge t1
join tlarge_two t2 on t2.id=t1.id
join tsmall ts on ts.id = t1.id
where t1.mod5_id=1
;

-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name       | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |            |  10000 |    15M|   671   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                    |            |        |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)              | :TQ10003   |  10000 |    15M|   671   (1)| 00:00:01 |  Q1,03 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED              |            |  10000 |    15M|   671   (1)| 00:00:01 |  Q1,03 | PCWP |            |
|   4 |     PX RECEIVE                     |            |  10000 |    10M|   352   (1)| 00:00:01 |  Q1,03 | PCWP |            |
|   5 |      PX SEND HASH                  | :TQ10001   |  10000 |    10M|   352   (1)| 00:00:01 |  Q1,01 | P->P | HASH       |
|*  6 |       HASH JOIN BUFFERED           |            |  10000 |    10M|   352   (1)| 00:00:01 |  Q1,01 | PCWP |            |
|   7 |        PX BLOCK ITERATOR           |            |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|   8 |         TABLE ACCESS STORAGE FULL  | TSMALL     |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   9 |        PX RECEIVE                  |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|  10 |         PX SEND BROADCAST          | :TQ10000   |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,00 | P->P | BROADCAST  |
|  11 |          PX BLOCK ITERATOR         |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|* 12 |           TABLE ACCESS STORAGE FULL| TLARGE     |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  13 |     PX RECEIVE                     |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|  14 |      PX SEND HASH                  | :TQ10002   |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | P->P | HASH       |
|  15 |       PX BLOCK ITERATOR            |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | PCWC |            |
|  16 |        TABLE ACCESS STORAGE FULL   | TLARGE_TWO |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | PCWP |            |
-------------------------------------------------------------------------------------------------------------------------------

This is a rather strange plan, and what it actually does in steps 6-11 (which are the TS and T1 joins steps) is very inefficient. It’s worth explaining in order to better understand parallel execution.

The join order is TS, T1, T2

1. The query starts with reading TS and preparing it to join to T1. A set of PQ processes read the table in chunks (Step 7: PX BLOCK ITERATOR) and build an in-memory HASH table by the join key
2. A different set of PQ processes read the T1 table in chunks (Step 11: PX BLOCK ITERATOR) and broadcast the rows to each of the other set of slaves
3. As slaves Set 1 receive records, they probe them against the in-memory table and stage them (Step 6: HASH JOIN BUFFERED)

etc.

You can already see this is not the outcome we wanted. In the previous two table examples, we actually used PQ_DISTRIBUTE(T1 BROADCAST NONE) – with much better results. This is conceptually wrong as the hint actually says “take the current result set and broadcast it to all processes”, but it works for two table joins.

So lets try that:

select /*+PARALLEL(8) NO_PX_JOIN_FILTER(T2) NO_PQ_REPLICATE(T1) NO_PQ_REPLICATE(T2) PQ_DISTRIBUTE(T1 BROADCAST NONE ) */* 
from tlarge t1
join tlarge_two t2 on t2.id=t1.id
join tsmall ts on ts.id = t1.id
where t1.mod5_id=1
;
-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name       | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |            |  10000 |    15M|   671   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                    |            |        |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)              | :TQ10003   |  10000 |    15M|   671   (1)| 00:00:01 |  Q1,03 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED              |            |  10000 |    15M|   671   (1)| 00:00:01 |  Q1,03 | PCWP |            |
|   4 |     PX RECEIVE                     |            |  10000 |    10M|   352   (1)| 00:00:01 |  Q1,03 | PCWP |            |
|   5 |      PX SEND HASH                  | :TQ10001   |  10000 |    10M|   352   (1)| 00:00:01 |  Q1,01 | P->P | HASH       |
|*  6 |       HASH JOIN                    |            |  10000 |    10M|   352   (1)| 00:00:01 |  Q1,01 | PCWP |            |
|   7 |        PX RECEIVE                  |            |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   8 |         PX SEND BROADCAST          | :TQ10000   |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,00 | P->P | BROADCAST  |
|   9 |          PX BLOCK ITERATOR         |            |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS STORAGE FULL| TSMALL     |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  11 |        PX BLOCK ITERATOR           |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|* 12 |         TABLE ACCESS STORAGE FULL  | TLARGE     |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|  13 |     PX RECEIVE                     |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,03 | PCWP |            |
|  14 |      PX SEND HASH                  | :TQ10002   |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | P->P | HASH       |
|  15 |       PX BLOCK ITERATOR            |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | PCWC |            |
|  16 |        TABLE ACCESS STORAGE FULL   | TLARGE_TWO |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | PCWP |            |
-------------------------------------------------------------------------------------------------------------------------------

Instant success, we have the plan we want for the joining of TS and T1. Now let’s apply the same approach to T2 (TLARGE_TWO)

select /*+PARALLEL(8) NO_PX_JOIN_FILTER(T2) NO_PQ_REPLICATE(T1) NO_PQ_REPLICATE(T2) PQ_DISTRIBUTE(T1 BROADCAST NONE ) PQ_DISTRIBUTE(T2 BROADCAST NONE ) */* 
from tlarge t1
join tlarge_two t2 on t2.id=t1.id
join tsmall ts on ts.id = t1.id
where t1.mod5_id=1
;
-------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                          | Name       | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib |
-------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                   |            |  10000 |    15M|   671   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                    |            |        |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)              | :TQ10002   |  10000 |    15M|   671   (1)| 00:00:01 |  Q1,02 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN                       |            |  10000 |    15M|   671   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   4 |     PX RECEIVE                     |            |  10000 |    10M|   352   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   5 |      PX SEND BROADCAST             | :TQ10001   |  10000 |    10M|   352   (1)| 00:00:01 |  Q1,01 | P->P | BROADCAST  |
|*  6 |       HASH JOIN                    |            |  10000 |    10M|   352   (1)| 00:00:01 |  Q1,01 | PCWP |            |
|   7 |        PX RECEIVE                  |            |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   8 |         PX SEND BROADCAST          | :TQ10000   |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,00 | P->P | BROADCAST  |
|   9 |          PX BLOCK ITERATOR         |            |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|  10 |           TABLE ACCESS STORAGE FULL| TSMALL     |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|  11 |        PX BLOCK ITERATOR           |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|* 12 |         TABLE ACCESS STORAGE FULL  | TLARGE     |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|  13 |     PX BLOCK ITERATOR              |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | PCWC |            |
|  14 |      TABLE ACCESS STORAGE FULL     | TLARGE_TWO |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | PCWP |            |
-------------------------------------------------------------------------------------------------------------------------------

It appears we have excellent results – no HASH JOIN BUFFERED – and we have the BROADCAST – but unfortunately it’s far from perfect. In this case the resulting join between TS (TSMALL) and T1 (TLARGE) is broadcasted to every process (Step 5: PX SEND BROADCAST). If the join results in many records – then a lot of data will be broadcasted. And with each additional table – if the number of records grow, the amount of data broadcasted just increases.

Now if we use the OTHER method – with the SWAP_JOIN_INPUTS and the PROPER use of PQ_DISTRIBUTE hint, the query looks like this:

select /*+PARALLEL(8) NO_PX_JOIN_FILTER(T2) NO_PQ_REPLICATE(T1) NO_PQ_REPLICATE(T2) 
LEADING(TS, T1, T2)
PQ_DISTRIBUTE(T1 NONE BROADCAST ) SWAP_JOIN_INPUTS(T1)
PQ_DISTRIBUTE(T2 NONE BROADCAST ) SWAP_JOIN_INPUTS(T2)
*/* 
from tlarge t1
join tlarge_two t2 on t2.id=t1.id
join tsmall ts on ts.id = t1.id
where t1.mod5_id=1
;
-----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name       | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib |
-----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |            |  10000 |    15M|   671   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                  |            |        |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)            | :TQ10002   |  10000 |    15M|   671   (1)| 00:00:01 |  Q1,02 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN                     |            |  10000 |    15M|   671   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   4 |     PX RECEIVE                   |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   5 |      PX SEND BROADCAST           | :TQ10000   |    100K|    52M|   319   (0)| 00:00:01 |  Q1,00 | P->P | BROADCAST  |
|   6 |       PX BLOCK ITERATOR          |            |    100K|    52M|   319   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS STORAGE FULL | TLARGE_TWO |    100K|    52M|   319   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|*  8 |     HASH JOIN                    |            |  10000 |    10M|   352   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   9 |      PX RECEIVE                  |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|  10 |       PX SEND BROADCAST          | :TQ10001   |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | P->P | BROADCAST  |
|  11 |        PX BLOCK ITERATOR         |            |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|* 12 |         TABLE ACCESS STORAGE FULL| TLARGE     |  20000 |    10M|   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|  13 |      PX BLOCK ITERATOR           |            |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,02 | PCWC |            |
|  14 |       TABLE ACCESS STORAGE FULL  | TSMALL     |  10000 |  5390K|    32   (0)| 00:00:01 |  Q1,02 | PCWP |            |
-----------------------------------------------------------------------------------------------------------------------------

Now this plan looks much better. Also notice how I arranged my hints for each set of tables and forced a specific JOIN order. I don’t want any changes to the join order to happen.

Let me run through the steps.

1. The “A” set of PQ processes start by reading T2(TLARGE_TWO) table in chunks (Plan Step 6:PX BLOCK ITERATOR) and we send all records to THE OTHER set of PQ processes. This is correct – we start by reading the LAST table to be joined. You can tell which tables will be processed first by looking at the in memroy queue table names (:TQ10000 is first, :TQ10001 second etc)
2. The “B” set of PQ processes receive the data (Plan Step 4: PX RECEIVE) and build and in memory table
3. The same set of “A” PQ PROCESSES as in Step 1 – the one reading T2 table are now free, and read the T1 (TLARGE) table in chunks (Plan Step 11: PX BLOCK ITERATOR) and broadcast to the SAME OTHER PQ processes in step 2. The same “B” set of PQ processes receive the T1 table and build a second in-memory hash table for T1. Now they have both T2 and T1 in memory.
4. The “B” set of PQ processes now read the TS table (TSMALL) in chunks (Plan Step 13: PX BLOCK ITERATOR) and probe in turns the T1 (Plan Step 8: HASH JOIN) and T2 tables (Plan Step 3: HASH JOIN) and produce output to the end user

The important thing here is that the “result set” is not broadcasted over and over, instead each table being added to the result set is broadcasted.

The PQ_REPLICATE hint (and NO_PQ_REPLICATE) are discussed in Part 2 of the series.
The NO_PX_JOIN_FILTER filter will be discussed in a future blog.

In Conclusion
The PQ_DISTRIBUTE hint and HASH JOINS can be tricky to use, as they depend very much on the join order and the SWAP_JOIN_INPUTS functionality. There is a “correct” and “incorrect” way to do broadcasts with hash joins with the incorrect way only being equivalent to the correct way when only 2 tables are involved.

I hope this was clear, and I urge you to experiment and to inflate the test tables further by increasing the record counts and observing resource usage. Obviously the TSMALL table should be hashed and not joined last, but the idea here is to illustrate how to control the execution to exactly what you believe is the best approach.

All 3 test tables are build the same way, just with different number of rows via this script:

create table tlarge_two as
select rownum id, mod(rownum,5) mod5_id, mod(rownum,5000) mod5000_id, sysdate dt_fixed, sysdate - rownum/24/60 dt_dec, sysdate + rownum/24/60 dt_pos, sysdate + ora_hash(rownum,65,535)/24 dt_rand, sysdate+mod(rownum,10) dt_mod10, rpad('x',500,'x') filler
from (select rownum r from dual connect by level <= 10000) r1, (select rownum r from dual connect by level <= 10)
;

TLARGE and TLARGE_TWO are built with above
TSMALL: replace “level <= 10” with “level <= 1”

email

Interested in working with Christo? Schedule a tech call.

About the Author

An Oracle ACE with a deep understanding of databases, application memory, and input/output interactions, Christo is an expert at optimizing the performance of the most complex infrastructures. Methodical and efficiency-oriented, he equates the role of an ATCG Principal Consultant in many ways to that of a data analyst: both require a rigorous sifting-through of information to identify solutions to often large and complex problems. A dynamic speaker, Christo has delivered presentations at the IOUG, the UKOUG, the Rocky Mountain Oracle Users Group, Oracle Open World, and other industry conferences.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *