Oracle Parallel Query Hints Reference – Part 7: PQ_DISTRIBUTE and Partitioned Tables

Posted in: Technical Track

In Part 1 of the series, we covered in detail how the PQ_DISTRIBUTE can change the data distribution method across parallel query workers and the importance of avoiding BUFFERED hash joins. Through the remaining parts of the series, I showed examples of how partitioning and parallel queries work.

In this blog post, I will cover how partitioning can help when joining two large tables – especially in cases where no filtering predicate can reduce one of the input data sets.

Let’s take the original example from Part 1:

select /*+PARALLEL(8) */*
from tlarge t1
join tlarge_two t2 on t2.id=t1.id
 
------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name       | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |            |   100K|   105M|       |  1308   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR            |            |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)      | :TQ10002   |   100K|   105M|       |  1308   (1)| 00:00:01 |  Q1,02 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN BUFFERED      |            |   100K|   105M|  6904K|  1308   (1)| 00:00:01 |  Q1,02 | PCWP |            |
|   4 |     PX RECEIVE             |            |   100K|    52M|       |   319   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   5 |      PX SEND HASH          | :TQ10000   |   100K|    52M|       |   319   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   6 |        PX BLOCK ITERATOR   |            |   100K|    52M|       |   319   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |         TABLE ACCESS FULL  | TLARGE     |   100K|    52M|       |   319   (0)| 00:00:01 |  Q1,00 | PCWP |            |
|   8 |     PX RECEIVE             |            |   100K|    52M|       |   319   (0)| 00:00:01 |  Q1,02 | PCWP |            |
|   9 |      PX SEND HASH          | :TQ10001   |   100K|    52M|       |   319   (0)| 00:00:01 |  Q1,01 | P->P | HASH       |
|  10 |       PX BLOCK ITERATOR    |            |   100K|    52M|       |   319   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|  11 |        TABLE ACCESS FULL   | TLARGE_TWO |   100K|    52M|       |   319   (0)| 00:00:01 |  Q1,01 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------------

.
As you can see, we have a HASH JOIN BUFFERED operation – which means that the rows will be staged ON DISK before being passed on to the next operation. Depending on the size of the data, this can be quite disastrous, especially considering all the problems Oracle has with writing to TEMP space.

Fortunately, parallel queries work well with partitioned tables, and in the case when one of the tables is partitioned on the join key, the additional row re-shuffling can be eliminated. Consider the tlarge_p table which is a copy of the tlarge table, but hash partitioned by ID with 16 partitions:

select /*+PARALLEL(8) */* 
from tlarge t1
join tlarge_p t2 on t2.id=t1.id

--------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |          |   100K|   105M|       |  1313   (0)| 00:00:01 |       |       |        |      |            |
|   1 |  PX COORDINATOR                  |          |       |       |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)            | :TQ10001 |   100K|   105M|       |  1313   (0)| 00:00:01 |       |       |  Q1,01 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN                     |          |   100K|   105M|  6904K|  1313   (0)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   4 |     PART JOIN FILTER CREATE      | :BF0000  |   100K|    52M|       |   319   (0)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   5 |      PX RECEIVE                  |          |   100K|    52M|       |   319   (0)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   6 |       PX SEND PARTITION (KEY)    | :TQ10000 |   100K|    52M|       |   319   (0)| 00:00:01 |       |       |  Q1,00 | P->P | PART (KEY) |
|   7 |        PX BLOCK ITERATOR         |          |   100K|    52M|       |   319   (0)| 00:00:01 |       |       |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS FULL        | TLARGE   |   100K|    52M|       |   319   (0)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
|   9 |     PX PARTITION HASH JOIN-FILTER|          |   100K|    52M|       |   325   (0)| 00:00:01 |:BF0000|:BF0000|  Q1,01 | PCWC |            |
|  10 |      TABLE ACCESS FULL           | TLARGE_P |   100K|    52M|       |   325   (0)| 00:00:01 |:BF0000|:BF0000|  Q1,01 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------------------------------------


Now the plan looks very different. There are a number of things I want to highlight with this plan.

The HASH JOIN BUFFERED is gone – meaning that if possible, Oracle will execute this join entirely in memory. This is a major optimization.
The number of data shuffling steps is reduced. One of the PX SEND HASH operators is gone – meaning that only one of the table has its data shuffled to a set of PQ workers.
The other PX SEND HASH is replaced by PX SEND PARTITION (KEY) – meaning that data will be distributed according to table partition keys.
There is a new step in #4 – PART JOIN FILTER CREATE. This is a very interesting step, full explanation to follow.
The access pattern changes for one of tables from PX BLOCK ITERATOR (rowid range reading) to PX PARTITION HASH with a JOIN-FILTER – meaning that each PQ worker will be process an entire partition instead of a ROWID block range. JOIN-FILTER is also a very interesting feature and will be explained a bit further down.

The distribution hint for the above plan is: PQ_DISTRIBUTE(T2 PARTITION NONE)

This part can be confusing, but what the hint instructs the query optimizer to do is: When you join T2 table, take the current result set (which in this case is T1 table) and send it to the parallel workers according to the partition boundaries of T2. Then Read T2 table for the join (no re-distribution of T2).

In the case of two tables join, the hint can be expressed in two ways, achieving the same result. But when the tables joined are three or more, then the proper function of the PQ_DISTRIBUTE becomes very important. Examples of this and more are in Part 1 of the series.

This plan is much more efficient overall and will make a significant difference in performance. Typically, this can be used in data warehouse environments where one of the dimension tables is particularly large. Then the FACT table can be hash partitioned on that large dimension key. But in most cases, you would be able to partition the dimension table, as well, which would make the plan look like this:

select /*+PARALLEL(8) */* 
from tlarge_p t1
join tlarge_two_p t2 on t2.id=t1.id

---------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation               | Name         | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |              |   100K|   105M|       |  1319   (1)| 00:00:01 |       |       |        |      |            |
|   1 |  PX COORDINATOR         |              |       |       |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10000     |   100K|   105M|       |  1319   (1)| 00:00:01 |       |       |  Q1,00 | P->S | QC (RAND)  |
|   3 |    PX PARTITION HASH ALL|              |   100K|   105M|       |  1319   (1)| 00:00:01 |     1 |    16 |  Q1,00 | PCWC |            |
|*  4 |     HASH JOIN           |              |   100K|   105M|  6904K|  1319   (1)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
|   5 |      TABLE ACCESS FULL  | TLARGE_P     |   100K|    52M|       |   325   (0)| 00:00:01 |     1 |    16 |  Q1,00 | PCWP |            |
|   6 |      TABLE ACCESS FULL  | TLARGE_TWO_P |   100K|    52M|       |   325   (0)| 00:00:01 |     1 |    16 |  Q1,00 | PCWP |            |
---------------------------------------------------------------------------------------------------------------------------------------------


[/sq]

This plan has ZERO data re-shuffling. Each PQ Worker is assigned a pair of partitions and produces a joined result set. The most efficient way to do parallel joins.

Now let’s mix things up a bit - let’s add a 3rd table - tsmall:


select /*+PARALLEL(8) */* 
from tlarge t1
join tlarge_two_p t2 on t2.id=t1.id
join tsmall ts on ts.id = t1.id

----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name         | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |              | 10000 |    15M|   677   (1)| 00:00:01 |       |       |        |      |            |
|   1 |  PX COORDINATOR                  |              |       |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)            | :TQ10003     | 10000 |    15M|   677   (1)| 00:00:01 |       |       |  Q1,03 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN                     |              | 10000 |    15M|   677   (1)| 00:00:01 |       |       |  Q1,03 | PCWP |            |
|   4 |     PART JOIN FILTER CREATE      | :BF0000      | 10000 |    10M|   352   (1)| 00:00:01 |       |       |  Q1,03 | PCWP |            |
|   5 |      PX RECEIVE                  |              | 10000 |    10M|   352   (1)| 00:00:01 |       |       |  Q1,03 | PCWP |            |
|   6 |       PX SEND PARTITION (KEY)    | :TQ10002     | 10000 |    10M|   352   (1)| 00:00:01 |       |       |  Q1,02 | P->P | PART (KEY) |
|*  7 |        HASH JOIN BUFFERED        |              | 10000 |    10M|   352   (1)| 00:00:01 |       |       |  Q1,02 | PCWP |            |
|   8 |         PX RECEIVE               |              | 10000 |  5390K|    32   (0)| 00:00:01 |       |       |  Q1,02 | PCWP |            |
|   9 |          PX SEND HASH            | :TQ10000     | 10000 |  5390K|    32   (0)| 00:00:01 |       |       |  Q1,00 | P->P | HASH       |
|  10 |           PX BLOCK ITERATOR      |              | 10000 |  5390K|    32   (0)| 00:00:01 |       |       |  Q1,00 | PCWC |            |
|  11 |            TABLE ACCESS FULL     | TSMALL       | 10000 |  5390K|    32   (0)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
|  12 |         PX RECEIVE               |              |   100K|    52M|   319   (0)| 00:00:01 |       |       |  Q1,02 | PCWP |            |
|  13 |          PX SEND HASH            | :TQ10001     |   100K|    52M|   319   (0)| 00:00:01 |       |       |  Q1,01 | P->P | HASH       |
|  14 |           PX BLOCK ITERATOR      |              |   100K|    52M|   319   (0)| 00:00:01 |       |       |  Q1,01 | PCWC |            |
|  15 |            TABLE ACCESS FULL     | TLARGE       |   100K|    52M|   319   (0)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|  16 |     PX PARTITION HASH JOIN-FILTER|              |   100K|    52M|   325   (0)| 00:00:01 |:BF0000|:BF0000|  Q1,03 | PCWC |            |
|  17 |      TABLE ACCESS FULL           | TLARGE_TWO_P |   100K|    52M|   325   (0)| 00:00:01 |:BF0000|:BF0000|  Q1,03 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------------------------


The addition of the small table has now “broken” our execution plan and it is using a BUFFERED HASH JOIN. This is the most classical example – where you have a good plan, but adding one small join breaks the plan to something incredibly slow due to the added on disk stage step.

Attempting to use PQ_DISTRIBUTE(ts BROADCAST NONE) or PQ_DISTRIBUTE(ts NONE BROADCAST) will not result in any change of the plan, because the TS is not considered joined as it is the starting table.

Armed with the knowledge from the Part 1 blog, fully understanding the hint, we can now write the necessary hints to eliminate the buffered join:

select /*+PARALLEL(8) LEADING (TS, T1, T2) PQ_DISTRIBUTE(t1 BROADCAST NONE) NO_PQ_REPLICATE(t1)*/* 
from tlarge t1
join tlarge_two_p t2 on t2.id=t1.id
join tsmall ts on ts.id = t1.id
;

----------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name         | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |              | 10000 |    15M|   677   (1)| 00:00:01 |       |       |        |      |            |
|   1 |  PX COORDINATOR                  |              |       |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)            | :TQ10002     | 10000 |    15M|   677   (1)| 00:00:01 |       |       |  Q1,02 | P->S | QC (RAND)  |
|*  3 |    HASH JOIN                     |              | 10000 |    15M|   677   (1)| 00:00:01 |       |       |  Q1,02 | PCWP |            |
|   4 |     PART JOIN FILTER CREATE      | :BF0000      | 10000 |    10M|   352   (1)| 00:00:01 |       |       |  Q1,02 | PCWP |            |
|   5 |      PX RECEIVE                  |              | 10000 |    10M|   352   (1)| 00:00:01 |       |       |  Q1,02 | PCWP |            |
|   6 |       PX SEND PARTITION (KEY)    | :TQ10001     | 10000 |    10M|   352   (1)| 00:00:01 |       |       |  Q1,01 | P->P | PART (KEY) |
|*  7 |        HASH JOIN                 |              | 10000 |    10M|   352   (1)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   8 |         PX RECEIVE               |              | 10000 |  5390K|    32   (0)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   9 |          PX SEND BROADCAST       | :TQ10000     | 10000 |  5390K|    32   (0)| 00:00:01 |       |       |  Q1,00 | P->P | BROADCAST  |
|  10 |           PX BLOCK ITERATOR      |              | 10000 |  5390K|    32   (0)| 00:00:01 |       |       |  Q1,00 | PCWC |            |
|  11 |            TABLE ACCESS FULL     | TSMALL       | 10000 |  5390K|    32   (0)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
|  12 |         PX BLOCK ITERATOR        |              |   100K|    52M|   319   (0)| 00:00:01 |       |       |  Q1,01 | PCWC |            |
|  13 |          TABLE ACCESS FULL       | TLARGE       |   100K|    52M|   319   (0)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|  14 |     PX PARTITION HASH JOIN-FILTER|              |   100K|    52M|   325   (0)| 00:00:01 |:BF0000|:BF0000|  Q1,02 | PCWC |            |
|  15 |      TABLE ACCESS FULL           | TLARGE_TWO_P |   100K|    52M|   325   (0)| 00:00:01 |:BF0000|:BF0000|  Q1,02 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------------------------

NOTE: NO_PQ_REPLICATE hint added to illustrate clearly what is going on.

Here’s how it’s done:
We include LEADING(TS, T1, T2) hint to ensure the join order is fixed.
With a fixed join order, we know T1 will be joined to TS so we can instruct via PQ_DISTRIBUTE(t1 BROADCAST NONE) that “when we join T1 table, take the current result set (which is currently only ts) and broadcast it to all PQ workers. Then each worker read the t1 table without any redistribution”.

When it comes to joining the T2 table (which is partitioned) then we distribute the result set (which at this point includes TS and T1 joined) via the partition key and let the PQ workers join partition by partition for T2.

About PART JOIN FILTER CREATE and PX PARTITION HASH JOIN-FILTER.

This is probably one of my all-time favorite optimizations. It is incredibly powerful for ETL like data loads and is one feature that very few people know about.

When Oracle builds the in-memory HASH table as part of a hash join, Oracle uses special BLOOM FILTERS that can be used to test the presence, or more specifically the absence, of a record from a particular partition. What this means is that once the hash table is built, Oracle will know which partitions will definitely NOT have the records it just read and will skip reading these partitions in the PX PARTITION HASH JOIN-FILTER. That’s why for Pstart (partition start) and Pstop (partition stop) we have :BF0000.

This also works on RANGE partitions or combination of RANGE-HASH partitions. Why is this very important?

Let’s take one very common use case: You have a very large FACT table called ORDERS. You have it RANGE partitioned by CREATED_DATE and HASH sub-partitioned by ORDER_ID. You receive a daily file with either new orders or modified orders that is relatively large (say 100 million records).

Generally speaking, there are two options on how to merge the new records in.
1. Use an index on something like ORDER_ID and use NESTED LOOPS to perform 100 million lookups.
2. Use a HASH join and read the entire ORDERS table.

Most people go with Option #1 because they have a relatively slow IO subsystem or it’s the only option they believe they have, and this ends up taking many hours to process.

But by using the PART JOIN FILTER and PX PARTITION HASH JOIN-FILTER operations, the filter on CREATED_DATE will automatically filter out old partitions from the scanning and will only read partitions which have dates that are present in the input file / table. So if you only have orders from the last week, then only partitions from the last week will be read.

Conclusions

The PQ_DISTRIBUTE hint is very poorly documented in Oracle documentation. By understanding how the hint works, especially with three or more tables, we can control exactly how the data will be joined, which unlocks additional optimizations such as PARTTION JOIN-FILTERs. This allows us to build very fast queries, especially in ETL context.

Oracle’s Parallel Queries can span multiple servers via RAC, which gives true horizontal scaling capabilities. Without proper control of data distribution, we may end up in situations where we have the hardware and software available, but it just won’t work at full potential due to a suboptimal execution plan.

email

Interested in working with Christo? Schedule a tech call.

About the Author

An Oracle ACE with a deep understanding of databases, application memory, and input/output interactions, Christo is an expert at optimizing the performance of the most complex infrastructures. Methodical and efficiency-oriented, he equates the role of an ATCG Principal Consultant in many ways to that of a data analyst: both require a rigorous sifting-through of information to identify solutions to often large and complex problems. A dynamic speaker, Christo has delivered presentations at the IOUG, the UKOUG, the Rocky Mountain Oracle Users Group, Oracle Open World, and other industry conferences.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *