I got to troubleshoot an amazing situation a few weeks ago. I think, we can all agree that assuming we are on a supported configuration of a production release of Oracle, it is essentially inconceivable that allowing a single query to run on your system can flip another query’s plans and cause major performance issues (and in this case even downtime!)
There have been plenty of posts about bind peeking. Alex Gorbachev wrote about it last year, and so did Jonathan Lewis. It’s a well known issue. However what hasn’t been written about is when it is expected to strike and cause you headaches.
A single query can change the plan of a number of other queries, but just sometimes.
One day you come to work (or get woken up) and you have a badly performing system. You look into it and it comes to one query. Nothing has changed, nothing has been modified.
I found the query no problems. I did an explain plan, comes up with an index range scan. Given the execution time however, it can’t be a range scan, it’s taking way too much time. Looking at v$sql_plan reveals the full table scan. At this point I was almost sure what the problem was.
How did I fix it? I did an “alter system flush shared_pool;” the query got reparsed, a new child cursor was created and all new executions of the query were using the expected and much faster index range scan.
Easy fix, but what happened? Nothing has been changed for months!
Sometimes it’s coincidence. The query just happens to run at the right time, or some job is delayed and queries run in a different order and the wrong plan is selected. Those are generally easy to catch as they happen early in the life of a system.
Sometimes it’s load. You decide to run an additional application, which causes more SQL statements to be aged out and re-parsed, creating a higher chance of parsing with “bad” values. Although the additional load will be blamed for the “slow” performance, the root cause will eventually be discovered. In a way, it will be expected that something will go bad when you increase the load.
And sometimes it’s a single ad hoc query with a new predicate that starts the slowly-ticking time bomb.
Here is how it happens . . .
When you first create a new table, it has no statistics. The default 10g analyse job picks it up in the night and analyses the table. The next day, all queries start using the CBO (Cost Based Optimizer), and column usage statistics start to gather.
What are column usage statistics? They are statistics collected in the system table sys.col_usage$
. According to various sources, information is flushed every 15-or-so minutes. col_usage$
has various columns with counts:
SQL> desc sys.col_usage$;
Name Type Nullable Default Comments
----------------- ------ -------- ------- --------
OBJ# NUMBER Y
INTCOL# NUMBER Y
EQUALITY_PREDS NUMBER Y
EQUIJOIN_PREDS NUMBER Y
NONEQUIJOIN_PREDS NUMBER Y
RANGE_PREDS NUMBER Y
LIKE_PREDS NUMBER Y
NULL_PREDS NUMBER Y
TIMESTAMP DATE Y
Each of these columns will have a count for the number of times this type of predicate has been used. I don’t fully understand when the timestamp is updated but it is related to when the row was added or updated (but not all updates according to my tests).
Once you have some column usage statistics and your table has been modified sufficiently to be a candidate for new statistics to be gathered, it will be analyzed during the night. Based on internal rules, histograms will be gathered for certain columns. During my testing, I’ve observed that only columns with cardinality greater than 1.5 will have histograms. They also need to have at least equality predicates, possibly other predicates (like range
or like) trigger histograms collections as well.
Then the next time your query is parsed, it will use the new statistics data and possibly select a better execution plan for your query. For its decision, the optimizer will use the bind values of only the first execution of this query.
Here’s the sequence of events.
New table -> nightly job analyses, no histograms -> new queries will populate col_usage$
-> table changes sufficiently to be a candidate for analysis -> table is re-analyzed with histograms -> as queries are aged out from shared pool, they are re-parsed with histograms and bind variable peeking.
Where’s the devil here?
A single ad hoc query can drastically change the statistics of a table. This may trigger the gathering of histograms (whenever the next analyze runs), and this, in turn, might enable a time bomb that, based on which bind values were used at parsing time, determines the execution plan of the query.
It’s a long sequence of events, and you may argue it’s very unlikely that this will ever happen. Consider this, though — every ad hoc query will only add to the column usage statistics.
One of the major problems here may be that there might be weeks between each step. From the time someone runs an ad hoc query, to the time the table has changed sufficiently to be re-analyzed, to the time the right load/delay/randomness happens and a really inefficient query plan is selected.
What is the solution? Disable bind peeking.
It is a feature I’ve never liked. To me, bind peeking is a workaround to enable histogram usage with bind variables. In my experience, this feature has caused me more headaches than saved-the-day moments.
This issue is somewhat addressed in 11g with the addition of multiple execution plans based on different bind variables. I haven’t tested yet, so I can’t say how well that works.
For now, I just turn off bind peeking.
21 Comments. Leave new
Nice post Christo. Turning off bind peeking comes with its side effects, like suboptimal execution plans. A query running with a nice index range scan may start to use an ugly full table scan or vice versa. You may need to add hints to some of the sqls. It is a two-edged sword I think.
I prefer keeping it enabled and changing the code to use literals (if there are not many different values queried of course) when one wants to guarantee the best plan.
Hi Christo,
nice post! Indeed we too are having a lot of issues related to bind variable peeking and plans that change without apparent reasons. I am a bit reluctant to turn off bind peeking completely, though. Instead we make use of SQL hints and stored outlines to solve the more stubborn cases.
Cheers,
Luca
Yas,
Do you have any test cases? I am curious what’s happening in your case. Turning off bind peeking means that the value will not be considered when parsing the sql. If the values is considered and has an effect on the execution plan, then chances are for a different value it wont be optimal.
If I have to choose between hunting and hinting sql and never knowing when something will ‘get’ me or having the entire system running 10% slower all the time, guess which one I will pick?
Predictable performance is what most businesses want. Bind peeking prior to 11g is not helping.
Great post Christo, i’ve turned bind variable peeking off in a number of places for the very same reasons. Predictability is normally more important to my clients than is absolute optimal performance. We need to have a stable platform to troubleshoot issues from and bind variable peeking does not allow for this. Rule #1 for production systems is manageability. I’ve even had to resort to deleting column stats and then locking the statistics in some cases to force oracle not to use this.
The simplest case possible:
[email protected]>create table t as select rownum id,0 status from all_objects;
Table created.
[email protected]>insert into t values (1,1);
1 row created.
[email protected]>commit;
Commit complete.
[email protected]>create index tind on t(status);
Index created.
[email protected]>exec dbms_stats.gather_table_stats(ownname=>user,tabname=>’T’,cascade=>true,estimate_percent=>100,method_opt=>’for columns status size 10′);
PL/SQL procedure successfully completed.
I have table with a skewed data distribution in one column and I also have a histogram on that column. My application is trying to find the rows with status=1, and my query has been running well using an index range scan with no problems. Then someone comes and decides to disable bind peeking. My application now makes the users crazy. Here is how:
[email protected]>var b1 number;
[email protected]>exec :b1 := 1
PL/SQL procedure successfully completed.
[email protected]>set autot trace stat
[email protected]>select * from t where status=:b1;
Statistics
———————————————————-
56 recursive calls
0 db block gets
11 consistent gets
0 physical reads
0 redo size
468 bytes sent via SQL*Net to client
381 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
1 rows processed
[email protected]>r
1* select * from t where status=:b1
Statistics
———————————————————-
0 recursive calls
0 db block gets
3 consistent gets
0 physical reads
0 redo size
468 bytes sent via SQL*Net to client
381 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
[email protected]>alter session set “_optim_peek_user_binds”=false;
Session altered.
[email protected]>alter system flush shared_pool;
System altered.
[email protected]>select * from t where status=:b1;
Statistics
———————————————————-
1086 recursive calls
0 db block gets
262 consistent gets
0 physical reads
0 redo size
464 bytes sent via SQL*Net to client
381 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
24 sorts (memory)
0 sorts (disk)
1 rows processed
[email protected]>r
1* select * from t where status=:b1
Statistics
———————————————————-
0 recursive calls
0 db block gets
76 consistent gets
0 physical reads
0 redo size
464 bytes sent via SQL*Net to client
381 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
Now my query is doing 76 consistent gets instead of 3, because it started using a full table scan.
What I had to do in the first place for a query like this was not to use a bind variable and use 1 as a literal, that way I can guarantee the best plan.
If you are disabling bind peeking in a production system to cure some sqls you may put yourself in a worse position and find out that some more critical applications started performing badly.
If you are disabling bind peeking in a test system, finding out the effected sqls after the change and tuning them, why not keep it enabled and find out the sqls effected by it in the same test system?
The ideal solution for me is to find out the candidate sql statements (the ones which are vulnerable to bind peeking) and start using literals for those starting with the test system.
You are right that the default values for automatic statistics gathering in 10g can change some sqls in time. We can just not use the default value for method_opt and lock the histograms and change them when needed.
By the way, I do not like bind peeking in 10g too.
Sorry for the formatting of my previous comment, I used the pre tags but as I see they are not working.
Interesting position yours. I think that disabling bind peeking maybe a solution, as it is in your situation but i think that it is not THE solution. Your analysis about sys.col_usage$ is interesting.
We had a situation were i think disabling bind peeking would be not so good. The problem was a bad db design but bind peeking can solve it. A table some columns three of this columns are filled depending on the customer so on a specific customer only one of this three columns has values and this values are selective. For example table
TEST (
pkid number,
a number,
b number,
c number
desc varchar2);
in a customer db columns b and c are always null and and a instead is filled. TAble test has milions of records and we have a query like this:
select * from TEST where
a= :a or b=:B or c=:c
for specific values of :A, :B, :C the query returns about 10 records. Without bind peeking optmizer will choose a full table scan that in our case would saturate I/O system.
This is a very particular case, but i think that bind peeking makes a good job.
Thanks,
Cristian
[…] another local post, Christo Kutrovsky writes that, on 10G you can only pick any two of bind peeking, ad hoc queries, and stable performance. “There have been plenty of posts about bind peeking. . . . It’s a well known […]
Yas,
What would you prefer a system that runs at 90% performance consistently, or a system that runs at 99% performance, but occasionally a job never finishes, or a web page takes 30 seconds?
In your example, I am sure you know what happens when you re-execute the same query with :b=0.
I was asking for a test case where having bind peeking on, is consistently superior then having it off. In your case, for 1 of the 40 000 values your query will run faster. For all other values, it will be significantly slower.
test> select * from t where status=:b1;
Elapsed: 00:00:00.02
Statistics
———————————————————-
1 recursive calls
0 db block gets
3 consistent gets
0 physical reads
0 redo size
576 bytes sent via SQL*Net to client
492 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
test> exec :b1 := 0;
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.01
test> select * from t where status=:b1;
40781 rows selected.
Elapsed: 00:00:00.12
Statistics
———————————————————-
0 recursive calls
0 db block gets
5580 consistent gets
0 physical reads
0 redo size
895309 bytes sent via SQL*Net to client
30390 bytes received via SQL*Net from client
2720 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
40781 rows processed
test> alter session set “_optim_peek_user_binds”=false;
Session altered.
Elapsed: 00:00:00.01
test> select * from t where status=:b1;
40781 rows selected.
Elapsed: 00:00:00.12
Statistics
———————————————————-
1 recursive calls
0 db block gets
2795 consistent gets
0 physical reads
0 redo size
895309 bytes sent via SQL*Net to client
30390 bytes received via SQL*Net from client
2720 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
40781 rows processed
Cristian Cudizio:
In 11g where bind peeking works based on the value, and creates multiple plans, it is probably doing a good job. (Havent tested yet).
But in 10g it does terrible job, unless you are always binding the same value. And if you are always binding the same value, why bind? (assuming it’s static, and no sql injection can happen).
Thanks for you answer. Really i was to lazy to post you the complete example. However i’ve made an error. The unused values are filled with a dummy value (-1).
In my case bind peeking makes a good job because bind values are “omogeneous” to those used in bind peeking, as if they are the same value, but they are not the same value. On the other hand in my terrible situation if it would not be so, performance would be never good. Histograms say’s that value picked is high selective, on the other hand an unknown value (as without bind peeking) make’s optimizer guess that value maybe -1 those of 2/3 of the records and thinking that’s better a full scan. So, select * from TEST where
a= :a or b=:B or c=:c is called a lot of times so, binding is better, values for :a :b and :c are always -1 and so it is better to use index. Firts time that query is parsed bind peeking plus histograms make CBO calculate a better execution plan based on indexes on the three columns. Without bind peeking Oracle does not know that i’m passing a very selective value and choose to make a full scan.
Maybe that this is bad design, but with CBO and bind peeking works well, as it worked well with RBO.
I hope i’ve been able to explain my example.
Thanks again for your attention.
Regards,
Cristian
cristiancudizio:
I see your point. See, that’s the problem with using dummy values. Had you used nulls, you wouldn’t have this problem. Not to mention a bit of space saved in the table, and a lot of space saved in the index.
What you can do in such cases is patchup the statistics and have your a/b/c columns look more selective, but leave the histograms as is.
Whenever you use the fixed values (-1) dont bind and the histograms will show the right plan. Whenever you bind, you will always use index.
cristiancudizio:
I did a bit more experimenting and just realized. You must have quite the skew in order to have 3 equal predicates and still get a FTS.
Anyways, you can avoid this by doing (with peeking turned off):
where a = :a
and a <> -1
where -1 is your popular value.
That way you are telling Oracle “I have an unknown value, but it’s not a popular one”. You could do the opposite by using between. Those are also in a way workarounds, but I think it’s significantly more reliable like that.
This will have the same effects as if your ‘default’ value was null to begin with. It changes your logic a bit, but … if you ever expect to get -1 as an argument, then it will be quite slow for that one ‘call’.
And if it happens to be parsed with -1’s then everything will be slow ;)
Good answer, i like it a lot, i’ve tested it and it works well.
Compliments again.
Cheers,
Cristian
Christo, it is definitely correct that using the other value for the bind parameter will make it much worse than disabling bind peeking. But what I suggested is not to bind it at all, as I said “I prefer keeping it enabled and changing the code to use literals (if there are not many different values queried of course) when one wants to guarantee the best plan.”.
I think we agree that for such a query using literals is the best solution.
By they way, it will really help if you can make a change to the blog to e-mail follow-up comments if someone commented on a post.
Yas, I will look into the email follow up option.
[…] caso il paragrafo su RBO, insieme ad alcuni post come ad esempio questo sul blog di Doug Burns, poi quello di Christo Kutrovsky sul blog di pythian seguito da un post di Alex Faktulin sugli istogrammi, mi hanno fatto riflettere e cambiare un […]
Friends,
How about trying the following approach?
1. leaving the bind peaking on
2. turn off 10G auto stats collection job (it does a terrible job anyways).
3. not use the 10G auto stats collection job, use customized stats collection job, selectively pick columns of tables with truly skewed data that are heavily used in your queries and collect histograms on such columns explicitly with dedicated stats collection scripts, for all other table stats collection scripts put opt_method
Friends,
How about trying the following approach?
1. leaving the bind peaking on
2. turn off 10G auto stats collection job (it does a terrible job anyways).
3. get a list of tables and columns whose data is truly skewed and create a script to collect histogram stats on them as follows (you can have one line for each table/column in the script):exec dbms_stats.gather_table_stats(ownname=>user,tabname=>’T’,cascade=>true,estimate_percent=>100,method_opt=>’for columns status size 10);
4. Create another script to collect stats on all other tables as appropriate but make sure you put size 1 –this will not collect histograms – old behavior.. sample script can be as follows (make sure you do not include tables in this script that you have in the other script)…
exec dbms_stats.gather_table_stats(ownname=>user,tabname=>’T’,cascade=>true,estimate_percent=>100,method_opt=>’for all columns size 1);
I believe the bind peeking will become beneficial with this approach…please post the results after your testing.
not use the 10G auto stats collection job, use customized stats collection job, selectively pick columns of tables with truly skewed data that are heavily used in your queries and collect histograms on such columns explicitly with dedicated stats collection scripts, for all other table stats collection scripts put opt_method
oops..sorry i forgot to remove the bottom paragraph….pardon me.
[…] “Optimize for” hint – there is a hint that overrides bind peeking (or parameter sniffing, as I think SQL Server calls it). You can say options (optimize for (@bind1 = ‘LONDON’ @bind2 unknown)) to tell the optimizer to optimize as if bind variable (parameter) bind1 was set to ‘LONDON’ and bind2 was not peekable. What a great way to deal with the problems of instability caused by bind variable peeking. […]