Oracle parallel query hints reference – part 5: PQ_DISTRIBUTE_WINDOW

Posted in: DBA Lounge, Oracle, Technical Track

Welcome to part 5 of the series.

First, some background: if you want to skip to the reference part, scroll down.

This hint was introduced relatively recently – in Oracle 12.1 – where for the first time, SQL Windowing functions are getting some proper parallel features. Analytics were introduced in Oracle 9i, which was released in 2001 and already supported parallelism. This is the first time in 17 years that Oracle improved analytical functions to better work with parallelism. A pretty poor product management, considering Exadata has been out for 10 years, but what do I know about priorities.

In Oracle versions prior to that, WINDOWING supported parallelism, but when it came to distributing the work across parallel workers, it only supported the SEND RANGE method – i.e. a distributed sort. This often resulted in a skew in data sent to parallel workers, affecting performance significantly. This was especially the case if the input was already somewhat ordered, which is often the case due to the natural order of data loads, or engineered for improved data locality.

When that happened, you would have 80% of the parallel workers completing the work and processing 10% of the data, and this one worker would be running forever.

Oracle 10g introduced the SQL MODEL clause (link to 12.2 MODEL docs), which allowed some pretty advanced manipulation of the result set – similar to Microsoft Excel formulas. The MODEL language allowed analytical style functions to be used, and a very powerful feature was the introduction of PARTITION BY in the definition of the MODEL. This split the data BY HASH amongst different parallel workers, allowing efficient usage of analytical functions, but was rather inconvenient as it has to be within the MODEL clause.

Even in 12.2, analytical functions are still a bit handicapped, as they are still not PARTITION aware. Here’s an example using the tlarge_p table (which is hash partitioned on ID) from my previous blogs in the series:

Serial plan:

select count(*) over (partition by id ) as rnk,t.*  from tlarge_p t;
-----------------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name     | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| E-Time   | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |          |  88447 |    28M|       |  8042   (1)| 00:00:01 |       |       |
|   1 |  PARTITION HASH ALL         |          |  88447 |    28M|       |  8042   (1)| 00:00:01 |     1 |    16 |
|   2 |   WINDOW SORT               |          |  88447 |    28M|    31M|  8042   (1)| 00:00:01 |       |       |
|   3 |    TABLE ACCESS STORAGE FULL| TLARGE_P |  88447 |    28M|       |  2341   (1)| 00:00:01 |     1 |    16 |
-----------------------------------------------------------------------------------------------------------------

Parallel plan:

select /*+ PARALLEL(4)*/count(*) over (partition by id ) as rnk,t.*  from tlarge_p t;
--------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| E-Time   | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |  88447 |    28M|       |   651   (1)| 00:00:01 |       |       |        |      |            |
|   1 |  PX COORDINATOR                 |          |        |       |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10001 |  88447 |    28M|       |   651   (1)| 00:00:01 |       |       |  Q1,01 | P->S | QC (RAND)  |
|   3 |    WINDOW SORT                  |          |  88447 |    28M|    31M|   651   (1)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE                  |          |  88447 |    28M|       |   650   (0)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   5 |      PX SEND HASH               | :TQ10000 |  88447 |    28M|       |   650   (0)| 00:00:01 |       |       |  Q1,00 | P->P | HASH       |
|   6 |       PX BLOCK ITERATOR         |          |  88447 |    28M|       |   650   (0)| 00:00:01 |     1 |    16 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS STORAGE FULL| TLARGE_P |  88447 |    28M|       |   650   (0)| 00:00:01 |     1 |    16 |  Q1,00 | PCWP |            |
--------------------------------------------------------------------------------------------------------------------------------------------------

The difference between the two plans is that the serial plan performs a WINDOW SORT for each partition. The WINDOW SORT is under (before) the PARTITION HASH ALL operator, sorting much less data and needing much less memory for the sort. The parallel plan sorts the ENTIRE result set. Basically, it doesn’t have the equivalent of Partition-Wise Join in the PARALLEL plan. Partition-Wise Joins work with both serial and parallel plans.

NOTE: Oracle 18c supports PARTITION aware parallel analytics. See Antognini’s blog post.. A new hint is added: USE_PARTITION_WISE_WIF.

But enough background. Here’s what I’ve discovered about the PQ_DISTRIBUTE_WINDOW functions. Note that as of the writing of this blog, the PQ_DISTRIBUTE_WINDOW hint is still undocumented and there are no references in MOS (my oracle support).

Reference for PQ_DISTRIBUTE_WINDOW:

Usage: PQ_DISTRIBUTE_WINDOW (@query_block data_distribution_method)

So far I’ve discovered three distribution methods:

Method 1 – HASH distribution then SORT. The input data set is hashed based on the partition key, and each worker performs a WINDOW SORT for its own data set. This is new in 12.1 and operates in the same manner as the SQL MODEL with PARTITION BY. Example:

select /*+ PARALLEL(4) PQ_DISTRIBUTE_WINDOW (@SEL$1 1) */count(*) over (partition by id ) as rnk,t.*  from tlarge t;
----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |    100K|    52M|       |   640   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                 |          |        |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10001 |    100K|    52M|       |   640   (1)| 00:00:01 |  Q1,01 | P->S | QC (RAND)  |
|   3 |    WINDOW SORT                  |          |    100K|    52M|    55M|   640   (1)| 00:00:01 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE                  |          |    100K|    52M|       |   638   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX SEND HASH               | :TQ10000 |    100K|    52M|       |   638   (0)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   6 |       PX BLOCK ITERATOR         |          |    100K|    52M|       |   638   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS STORAGE FULL| TLARGE   |    100K|    52M|       |   638   (0)| 00:00:01 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------------

This approach is suitable when the “keys” (PARTITION BY CLAUSE) are relatively equally sized, and won’t result in any one worker to receive a much larger data set to operate on. You can see the hint in the outline is:

Method 2 – SORT then HASH distribution. The input data set is sorted by each worker that reads the data, then it is sent via HASH distribution for consolidation. This uses the new WINDOW CONSOLIDATOR BUFFER step. This operates similarly to the GBY_PUSHDOWN – where data is aggregated twice. Once in each worker’s private result set and once in the “final” result set. Often such pre-aggregation reduces the amount of data exchanged by workers dramatically. In the case of WINDOW SORT, sorting performance decreases exponentially with the size of the data, so sorting smaller result sets is generally more efficient. Example:

select /*+ PARALLEL(4) PQ_DISTRIBUTE_WINDOW (@SEL$1 2) */count(*) over ( partition by mod5_id  ) as rnk,t.*  from tlarge t;
---------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name     | E-Rows |E-Bytes| Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |          |    100K|    52M|   640   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                  |          |        |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)            | :TQ10001 |    100K|    52M|   640   (1)| 00:00:01 |  Q1,01 | P->S | QC (RAND)  |
|   3 |    WINDOW CONSOLIDATOR BUFFER    |          |    100K|    52M|   640   (1)| 00:00:01 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE                   |          |    100K|    52M|   640   (1)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX SEND HASH                | :TQ10000 |    100K|    52M|   640   (1)| 00:00:01 |  Q1,00 | P->P | HASH       |
|   6 |       WINDOW SORT                |          |    100K|    52M|   640   (1)| 00:00:01 |  Q1,00 | PCWP |            |
|   7 |        PX BLOCK ITERATOR         |          |    100K|    52M|   638   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   8 |         TABLE ACCESS STORAGE FULL| TLARGE   |    100K|    52M|   638   (0)| 00:00:01 |  Q1,00 | PCWP |            |
---------------------------------------------------------------------------------------------------------------------------

This method is slightly more CPU-intensive, as data is sorted once in the worker that reads the data, then re-shuffled and consolidated in another parallel worker. This could require twice the amount of memory for sorting. This method is suitable when the “keys” (PARTITION BY clause) could be skewed, and it’s best to “distribute” the skew in the reader side as opposed to in the final receiving side of the PQ workers.

With the same query, you can hint Method 1 or Method 2 freely by just modifying the hint to use 1/2 accordingly.

Method 3 – SEND RANGE then SORT. This is the “classical” operator – the same functionality since Oracle 9i. In my testing, it’s a “fallback” option when the partition key and order keys are not the same. Data is from PQ readers to PQ receivers based on set “split” points, and each PQ receiver sorts the data. Example:

select /*+ PARALLEL(4) PQ_DISTRIBUTE_WINDOW (@SEL$1 3) */count(*) over ( partition by mod5_id  order by id) as rnk,t.*  from tlarge t;
----------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| E-Time   |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |    100K|    52M|       |   640   (1)| 00:00:01 |        |      |            |
|   1 |  PX COORDINATOR                 |          |        |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10001 |    100K|    52M|       |   640   (1)| 00:00:01 |  Q1,01 | P->S | QC (RAND)  |
|   3 |    WINDOW SORT                  |          |    100K|    52M|    55M|   640   (1)| 00:00:01 |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE                  |          |    100K|    52M|       |   638   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|   5 |      PX SEND RANGE              | :TQ10000 |    100K|    52M|       |   638   (0)| 00:00:01 |  Q1,00 | P->P | RANGE      |
|   6 |       PX BLOCK ITERATOR         |          |    100K|    52M|       |   638   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS STORAGE FULL| TLARGE   |    100K|    52M|       |   638   (0)| 00:00:01 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------------

In the above query, you cannot HINT a switch to Method 2 or Method 1. It has an “order by” clause which disables the pure multi-set aggregation and converts it to running total, which requires the classical SEND RANGE distribution execution.

That’s it for now. Feel free to comment if you discover more methods and more scenarios for the enhanced analytical functions data distributions.

email

Interested in working with Christo? Schedule a tech call.

About the Author

An Oracle ACE with a deep understanding of databases, application memory, and input/output interactions, Christo is an expert at optimizing the performance of the most complex infrastructures. Methodical and efficiency-oriented, he equates the role of an ATCG Principal Consultant in many ways to that of a data analyst: both require a rigorous sifting-through of information to identify solutions to often large and complex problems. A dynamic speaker, Christo has delivered presentations at the IOUG, the UKOUG, the Rocky Mountain Oracle Users Group, Oracle Open World, and other industry conferences.

1 Comment. Leave new

Girja Shanker
August 27, 2018 7:55 am

Hi Christo,
I have one issue with 2 Node Oracle 10g RAC (10.2.0.4), Application is using hard coded java functionality to connect to database and is connecting to first node only using vip, hence if Node1 goes down, Application is not failing over to Node2. (This can not be changed). I want to redirect my connections of node1 to node3(by adding one more node) and changing vip of Node3 to same as Node1 in case Node1 goes down.
My question is, what are changes required at Node3 so that it acts as Node1 in case of Node1 crash.
Thanks
Girja

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *