Oracle Parallel Query Hints Reference – Part 6: USE_PARTITION_WISE_GBY

Posted in: Oracle, Technical Track

Welcome to Part 6 of the series.

The USE_PARTITION_WISE_GBY hint is “brand new,” introduced in 12.2. I personally welcome this hint as I’ve cursed many times in the past due to the lack of control for this aspect of parallelism and partitioning.

Purpose: Controls whether to assign PARTITIONS (PX PARTITION HASH) or BLOCK RANGES (PX BLOCK) to Parallel workers.

In the absence of a hint, Oracle decides whether to do partition or block ranges based on the degree of parallelism and the number of partitions. This can lead to dramatic differences in performance, as the partition method often can leverage additional optimization such as partition-wise joins (partial or full), partitions-wise group by or in general better data distribution. The most important two are partition-wise joins, as they often eliminate the need for buffering the join result set, followed by partition-wise group by (and select distinct) as those reduce the amount of data that needs to be sorted. And sorting is something that grows in CPU usage exponentially.

If you have a table with 32 partitions and request a parallelism of 64 – then Oracle will choose PX BLOCK – in order to maximise parallelism. In the case of HASH partitions, you can easily adjust the requested parallelism to match the number of partitions, but in the case of RANGE partitions – sometimes a fixed date predicate can change the number of partitions involved, and mess with your intentions.

In the past, I’ve had to use tricks such as using a hidden parameter in an OPT_PARAM hint to force assigning of 1 worker per partition /*+opt_param(‘_px_partition_scan_threshold’, 1)*/ but with the introduction of the hint, we have much more reliable control.

Examples:
NOTE: The NO_GBY_PUSHDOWN hint is there only to make the plan slightly easier to read.

select /*+PARALLEL(16) NO_GBY_PUSHDOWN */ id, count(*) from tlarge_p t group by id;
------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name     | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |          |  88447 |  1122K|   164   (2)| 00:00:01 |       |       |        |      |            |
|   1 |  PX COORDINATOR                 |          |        |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)           | :TQ10001 |  88447 |  1122K|   164   (2)| 00:00:01 |       |       |  Q1,01 | P->S | QC (RAND)  |
|   3 |    HASH GROUP BY                |          |  88447 |  1122K|   164   (2)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   4 |     PX RECEIVE                  |          |  88447 |  1122K|   162   (0)| 00:00:01 |       |       |  Q1,01 | PCWP |            |
|   5 |      PX SEND HASH               | :TQ10000 |  88447 |  1122K|   162   (0)| 00:00:01 |       |       |  Q1,00 | P->P | HASH       |
|   6 |       PX BLOCK ITERATOR         |          |  88447 |  1122K|   162   (0)| 00:00:01 |     1 |    16 |  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS STORAGE FULL| TLARGE_P |  88447 |  1122K|   162   (0)| 00:00:01 |     1 |    16 |  Q1,00 | PCWP |            |
------------------------------------------------------------------------------------------------------------------------------------------

The tlarge_p table has 16 partitions, but even when I request parallelism 16 – The Optimizer chooses to use PX BLOCK ITERATOR – and the HASH GROUP BY in Step #3 will perform a group by of the entire result set in one shot. The Work performed is as follow:

– Assign a BLOCK range to PQ Set A
– PQ Set A reads a block range, hashes the ID and sends the record to PQ Set B
– PQ Set B receives a record, and performs a HASH GROUP BY
– Once all PQ Set A workers are complete, PQ Set B return the results to QC

select /*+PARALLEL(8) NO_GBY_PUSHDOWN */ id, count(*) from tlarge_p t group by id;
----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name     | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |          |  88447 |  1122K|   326   (1)| 00:00:01 |       |       |        |      |            |
|   1 |  PX COORDINATOR               |          |        |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)         | :TQ10000 |  88447 |  1122K|   326   (1)| 00:00:01 |       |       |  Q1,00 | P->S | QC (RAND)  |
|   3 |    PX PARTITION HASH ALL      |          |  88447 |  1122K|   326   (1)| 00:00:01 |     1 |    16 |  Q1,00 | PCWC |            |
|   4 |     HASH GROUP BY             |          |  88447 |  1122K|   326   (1)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
|   5 |      TABLE ACCESS STORAGE FULL| TLARGE_P |  88447 |  1122K|   325   (0)| 00:00:01 |     1 |    16 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------------------

With parallelism 8, I get PX PARTITION HASH ALL – notice the HASH GROUP BY dropped to Step 4, and there is no more a final re-shuffle. Those are significant savings in extra processing. Each PQ worker – picks up a partition, aggregates it and returns the result set. One step. Work is as follow:

– Assign a partition to PQ Set A
– PQ Set A reads a partition and aggregates the data by id
– Once all PQ Set A workers are complete, PQ Set A return the results to QC (no re-shuffle)

Now armed with the new USE_PARTITION_WISE_GBY hint, I can force partition-wise operation, even with 32 workers

select /*+PARALLEL(32) USE_PARTITION_WISE_GBY  NO_GBY_PUSHDOWN */ id, count(*) from tlarge_p t group by id;

----------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name     | E-Rows |E-Bytes| Cost (%CPU)| E-Time   | Pstart| Pstop |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |          |  88447 |  1122K|    82   (2)| 00:00:01 |       |       |        |      |            |
|   1 |  PX COORDINATOR               |          |        |       |            |          |       |       |        |      |            |
|   2 |   PX SEND QC (RANDOM)         | :TQ10000 |  88447 |  1122K|    82   (2)| 00:00:01 |       |       |  Q1,00 | P->S | QC (RAND)  |
|   3 |    PX PARTITION HASH ALL      |          |  88447 |  1122K|    82   (2)| 00:00:01 |     1 |    16 |  Q1,00 | PCWC |            |
|   4 |     HASH GROUP BY             |          |  88447 |  1122K|    82   (2)| 00:00:01 |       |       |  Q1,00 | PCWP |            |
|   5 |      TABLE ACCESS STORAGE FULL| TLARGE_P |  88447 |  1122K|    81   (0)| 00:00:01 |     1 |    16 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------------------------------

Given I only have 16 partitions, I will run at effective parallelism 16 – the other 16 workers will remain idle – but with high efficiency and no data re-shuffling.

Oracle 18.1 finally introduces partition wise support and the HINT USE_PARTITION_WISE_WIF for Windowing functions like ROW_NUMBER() and RANK() for Parallel Queries. Serial queries supported partition wise windowing from Oracle 9i (released 2001), but to add parallelism support – it only took 17 years. See Antognini’s blog for more details (link)

email

Interested in working with Christo? Schedule a tech call.

About the Author

An Oracle ACE with a deep understanding of databases, application memory, and input/output interactions, Christo is an expert at optimizing the performance of the most complex infrastructures. Methodical and efficiency-oriented, he equates the role of an ATCG Principal Consultant in many ways to that of a data analyst: both require a rigorous sifting-through of information to identify solutions to often large and complex problems. A dynamic speaker, Christo has delivered presentations at the IOUG, the UKOUG, the Rocky Mountain Oracle Users Group, Oracle Open World, and other industry conferences.

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *