Ever wondered why recreating certain indexes takes forever, even when you do so in parallel? Ever wondered why certain PQ queries just don’t run that fast?
Here’s a serious performance bug that’s been in Oracle for a while, and finally there are hints of it been fixed, but only partially.
The bug happens when performing sorting operations in parallel, and the source data is already well sorted. The “ranger” doesn’t do a good job of properly assigning row ranges to sorter processes, and ~90% of the rows end up being sent to the same parallel process, regardless of the level of parallelism. So even if you have 256 CPUs, only about 10% performance improvement is achieve, instead of a factor of your parallelism when running the query in parallel.
For example, if the non parallel sort/index creation took 45 minutes, running with parallel 32 will take 41 minutes, instead of the possible 1.4 minutes (assuming you have sufficient horsepower).
When running a “Sort” operation in parallel, there are 2 sets of parallel processes. Producers and consumers. The number of producer/consumer pairs depends on your parallelism settings. So in the case of parallel 32 there are 32 producers and 32 consumers. This is well illustrated in the Oracle documentation HERE (Oracle web based documentation down, will update later). As each ‘producer’ is reading data it is sending it to the appropriate consumer for that “range”. For example, consumer 1 takes A-B, consumer 2 takes C-F, consumer 3 G-L and etc. The exact split is dynamically calculated by the “Ranger” process. Unfortunately it doesn’t work so good with sorted data.
The same applies to index creation. Index creation is basically a big sort, followed by writing out the result set into a B-Tree structure. Index creation suffers from the exact same ranging issues, at least until 11.2.
Here’s an example:
-- Create mini-sample table create table mytest_s as select rownum r from dual connect by level <=400000; -- Fetch only 1 row, no need to fetch all begin for c in ( select /*+PARALLEL(t,4)*/ * from mytest_s t order by 1) loop exit; end loop; end; / select dfo_number "d", tq_id as "t", server_type, num_rows,rpad('x',round(num_rows*10/nullif(max(num_rows) over (partition by dfo_number, tq_id, server_type),0)),'x') as "pr", round(bytes/1024/1024) mb, process, instance i,round(ratio_to_report (num_rows) over (partition by dfo_number, tq_id, server_type)*100) as "%", open_time, avg_latency, waits, timeouts,round(bytes/nullif(num_rows,0)) as "b/r" from v$pq_tqstat order by dfo_number, tq_id, server_type desc, process; d t SERVER_TYPE NUM_ROWS pr MB PROCESS I % OPEN_TIME AVG_LATENCY WAITS TIMEOUTS b/r - - ----------- ---------- ----------- --- ------- - ---- ---------- ----------- ------- ---------- ---- 1 0 Ranger 372 xxxxxxxxxx 0 QC 1 100 0 0 0 0 11 1 0 Producer 126144 xxxxxxxxxx 1 P012 1 32 0 0 19 2 6 1 0 Producer 47304 xxxx 0 P013 1 12 0 0 8 1 6 1 0 Producer 110376 xxxxxxxxx 1 P014 1 28 0 0 17 2 6 1 0 Producer 116176 xxxxxxxxx 1 P015 1 29 0 0 16 0 6 1 0 Consumer 7885 0 P008 1 2 0 0 6 1 5 1 0 Consumer 7884 0 P009 1 2 0 0 6 1 6 1 0 Consumer 7884 0 P010 1 2 0 0 7 2 6 1 0 Consumer 376347 xxxxxxxxxx 2 P011 1 94 0 0 16 4 6 1 1 Producer 5536 xxxxxxxxxx 0 P008 1 55 0 0 577 568 3 1 1 Producer 4508 xxxxxxxx 0 P011 1 45 0 0 587 571 4 1 1 Consumer 100 xxxxxxxxxx 0 QC 1 100 0 0 1 0 161
As you can see from this test case, the sorter processes (4 consumers) had a very uneven split, with 94% of the rows been sent to only one consumer. I tested this case with parallel 64, and in that case 90% gets sent to 1 consumer, with the other 10% evenly distributed on the remaining ones.
This essentially reduces your execution time by at most 10%.
A very similar thing happens if you create an index. I did, however, catch an anomaly in the test case. With parallel 4, the distribution is 66/34/0/0, while with parallel 8, it’s 100/0/0/0/0/0/0/0. I.e., terrible as all the work will be performed by 1 process, absolutely the same as Serial, only a little worse as there will be inter-process communication.
create index on mytest_s (r) parallel 4; select dfo_number "d", tq_id as "t", server_type, num_rows,rpad('x',round(num_rows*10/nullif(max(num_rows) over (partition by dfo_number, tq_id, server_type),0)),'x') as "pr", round(bytes/1024/1024) mb, process, instance i,round(ratio_to_report (num_rows) over (partition by dfo_number, tq_id, server_type)*100) as "%", open_time, avg_latency, waits, timeouts,round(bytes/nullif(num_rows,0)) as "b/r" from v$pq_tqstat order by dfo_number, tq_id, server_type desc, process; SERVER_TYPE NUM_ROWS pr MB PROCESS I % OPEN_TIME AVG_LATENCY WAITS TIMEOUTS b/r ----------- ---------- ----------- -- -------- - --- ---------- ----------- ---------- ---------- ---------- Ranger 12 xxxxxxxxxx 0 QC 1 100 0 0 1 0 3974 Producer 132601 xxxxxxxxxx 2 P004 1 33 0 0 19 1 18 Producer 95265 xxxxxxx 2 P005 1 24 0 0 14 2 18 Producer 95265 xxxxxxx 2 P006 1 24 0 0 14 1 18 Producer 79497 xxxxxx 1 P007 1 20 0 0 11 0 18 Consumer 262308 xxxxxxxxxx 4 P000 1 66 0 0 76 73 18 Consumer 164 0 P001 1 0 0 0 77 74 19 Consumer 164 0 P002 1 0 0 0 76 73 19 Consumer 137364 xxxxx 2 P003 1 34 0 0 76 73 18 Producer 1 xxxxxxxxxx 0 P000 1 25 0 0 0 0 322 Producer 1 xxxxxxxxxx 0 P001 1 25 0 0 0 0 322 Producer 1 xxxxxxxxxx 0 P002 1 25 0 0 0 0 322 Producer 1 xxxxxxxxxx 0 P003 1 25 0 0 0 0 322 Consumer 4 xxxxxxxxxx 0 QC 1 100 0 0 1 0 322 create index on mytest_s (r) parallel 8; SERVER_TYPE NUM_ROWS pr MB PROCESS I % OPEN_TIME AVG_LATENCY WAITS TIMEOUTS b/r ----------- ---------- ------------- -- -------- - --- ---------- ----------- ---------- ---------- ---------- Ranger 0 0 QC 1 0 0 6 2 Producer 47304 xxxxxxxx 1 P008 1 12 0 0 14 2 18 Producer 51246 xxxxxxxxx 1 P009 1 13 0 0 15 1 18 Producer 47304 xxxxxxxx 1 P010 1 12 0 0 15 2 18 Producer 45220 xxxxxxxx 1 P011 1 11 0 0 15 2 18 Producer 59130 xxxxxxxxxx 1 P012 1 15 0 0 17 3 18 Producer 35478 xxxxxx 1 P013 1 9 0 0 10 3 18 Producer 55188 xxxxxxxxx 1 P014 1 14 0 0 16 1 18 Producer 59130 xxxxxxxxxx 1 P015 1 15 0 0 17 1 18 Consumer 400000 xxxxxxxxxx 7 P000 1 100 0 0 52 49 18 Consumer 0 0 P001 1 0 0 0 52 49 Consumer 0 0 P002 1 0 0 0 52 49 Consumer 0 0 P003 1 0 0 0 52 49 Consumer 0 0 P004 1 0 0 0 52 49 Consumer 0 0 P005 1 0 0 0 52 49 Consumer 0 0 P006 1 0 0 0 52 49 Consumer 0 0 P007 1 0 0 0 52 49 Producer 1 xxxxxxxxxx 0 P000 1 13 0 0 0 0 322 Producer 1 xxxxxxxxxx 0 P001 1 13 0 0 0 0 322 Producer 1 xxxxxxxxxx 0 P002 1 13 0 0 0 0 322 Producer 1 xxxxxxxxxx 0 P003 1 13 0 0 0 0 322 Producer 1 xxxxxxxxxx 0 P004 1 13 0 0 0 0 322 Producer 1 xxxxxxxxxx 0 P005 1 13 0 0 0 0 322 Producer 1 xxxxxxxxxx 0 P006 1 13 0 0 0 0 322 Producer 1 xxxxxxxxxx 0 P007 1 13 0 0 0 0 322 Consumer 8 xxxxxxxxxx 0 QC 1 100 0 0 2 1 322
To further explore the implications of this bug, I created a more elaborate test case. I created several types of data, and tested ordering against each “class.”
create table mytest as select rownum pk, trunc(dbms_random.value(0,400000)) rnd, floor(rownum/1000) type_1000, floor(rownum/5) type_5, mod(rownum,1000) type_1000mod,mod(rownum,5) type_5mod,mod(rownum,10009) type_10000mod, sysdate-rownum/100 as dt, rpad('x',10,'x') pad from dual connect by level <=400000;
To spare you some of the testing, here are my results. Query ran:
select /*+PARALLEL(t,[DEGREE])*/* from mytest order by [ORDER BY];
Degree | Order by | Distribution(%) |
---|---|---|
4 | pk | 94/2/2/2 |
64 | pk | 93/0… |
4 | rnd | 28/21/24/27 |
64 | rnd | 2/1/2/1… |
4 | type_1000 | 94/2/2/2 |
4 | type_5 | 94/2/2/2 |
64 | type_5 | 93/0/0…. |
4 | type_5mod | 40/20/20/20 |
8 | type_5mod | 20/20/0/20/20/0/20/0 |
8 | type_1000mod | 17/13/8/6/4/4/4/43 |
64 | type_1000mod | 4/3/1/4/2/1…. |
4 | dt | 96/2/2/0 |
64 | dt | 93/0/0/0… |
4 | type_5,pk | 2/2/2/94 |
4 | type_5,rnd | 2/2/2/94 |
4 | type_5_mod,pk | 20/20/20/39 |
64 | type_5_mod,pk | 19/19/19/19/19/0/0… |
4 | type_5_mod,rnd | 26/24/26/24 |
A few quick conclusions:
- The first column of order matters for distribution.
- If the Order by column has repeated values, PQ sort will be limited by number of distinct values, but only if they are not grouped together.
- Index creation on time series (log table, stock table) is slow, “type” indexes are slow.
- If you reorder the keys of an index, you may affect the time it takes to create it.
- Follow up on the preceding, especially true if you try to put low cardinality columns first to improve compression.
One could dispute in which cases data in the table is ordered, and it’s amazing how many cases this is:
- Time series data — ever-growing data. The PK is ordered; the “insert date” is somewhat ordered.
- Data warehouses — bulk load files are often ordered via some conditions.
- Sometimes it is good to reorder a table, to improve data locality and compression in data warehouses. This can, however, have negative effects on index build time.
- Sometimes one of the intermediate steps will return an ordered set for the final processing.
One example of the last type, is analytics. But that’s for a separate blog post.
And finally, to end on an optimistic note, it appears that 11.2 has the index creation issue resolved, but the order by in queries is still bad.
12 Comments. Leave new
Thanks for the sharing.
Do you(OP) have a metalink note to meet this phenomenon?
Dion, I am not sure what exactly you mean by “meet”, could you elaborate?
Christo.
Sorry for wrong word. :(
To rephrase my question –
“Is there any metalink document that is registered as a bug or something on this phenomenon?”
I am not aware of any such document, but I haven’t been very thorough in my search.
As I said, this is semi-fixed in 11g Release 2 (11.2) so Oracle knows about it.
I will make separate post with example at some point.
Hi Christo,
an interesting observation. Have you experimented with a bigger table – for ex., 20-40M rows? Since PX is made for that type of volumes, not 400K, it is reasonable to test with appropriate data volumes, especially with such high DOPs.
I’ve tried to repeat your case – but unfortunately for some reason v$pq_tqstat didn’t show statistics for PX slaves as a Consumer (only Producers). That was 10.2.0.4 & 10.2.0.4.2.
What Oracle version have you used?
Actually that’s how I discovered this, by having big table’s parallel operations take forever.
I’ve tested this in 10.2.0.4, 11.1.0.7 and 11.2 base.
I will try my example with larger tables to confirm it scales the same as when I discovered this.
[…] 5-Bug on Running Parallel Query with Order by on already sorted data causes problem? Christo Kutrovsky-Oracle Parallel Query Sorting and Index Creation Performance Problems […]
Did you find any metalink documentation since this post was made? we encounter the same problem but can’t find any official documentation of the problem.
Hi Zohar,
I don’t have any metalink documents to give you. I have been able to workaround the issue in some cases, but it’s mostly not going anywhere.
I do have a blog almost ready with a larger dataset, but it only confirms the issue further. i.e. not much change.
The reason you are getting uneven distribution of rows across producer processes is, that part is running in serial.
you did not specify parallel clause for table, while you specify parallel clause for index.
try
create table big_emp parallel (degree 4)
as select * from emp;
CREATE INDEX emp_ix ON emp (emp_id)
PARALLEL (DEGREE 4);
It would be interesting to see the results.
On queries like the one in my example it is sufficient to have a parallel hint to have the query run fully in parallel.
My entire test case is documented, I invite you to study the explain plans and re-run my test case.
Yes, confirmed. it was a bug. I check our all databases 10g and 11.1.6. the problem is, when query go parallel but the data is unevenly distributed across the slave processes which increases the query execution time. in case if all the 100% of data is being assigned to a slave process. the time of serial processing is nearly equal to time of parallel processing. in fact then parallel processing take more time.
cannot found this problem with 11.2.
I am expecting that you write something about order by clause in 11.2 and lets open it for discussion
Thanks.