I got to troubleshoot an amazing situation a few weeks ago. I think, we can all agree that assuming we are on a supported configuration of a production release of Oracle, it is essentially inconceivable that allowing a single query to run on your system can flip another query’s plans and cause major performance issues (and in this case even downtime!)
There have been plenty of posts about bind peeking. Alex Gorbachev wrote about it last year, and so did Jonathan Lewis. It’s a well known issue. However what hasn’t been written about is when it is expected to strike and cause you headaches.
A single query can change the plan of a number of other queries, but just sometimes.
One day you come to work (or get woken up) and you have a badly performing system. You look into it and it comes to one query. Nothing has changed, nothing has been modified.
I found the query no problems. I did an explain plan, comes up with an index range scan. Given the execution time however, it can’t be a range scan, it’s taking way too much time. Looking at v$sql_plan reveals the full table scan. At this point I was almost sure what the problem was.
How did I fix it? I did an “alter system flush shared_pool;” the query got reparsed, a new child cursor was created and all new executions of the query were using the expected and much faster index range scan.
Easy fix, but what happened? Nothing has been changed for months!
Sometimes it’s coincidence. The query just happens to run at the right time, or some job is delayed and queries run in a different order and the wrong plan is selected. Those are generally easy to catch as they happen early in the life of a system.
Sometimes it’s load. You decide to run an additional application, which causes more SQL statements to be aged out and re-parsed, creating a higher chance of parsing with “bad” values. Although the additional load will be blamed for the “slow” performance, the root cause will eventually be discovered. In a way, it will be expected that something will go bad when you increase the load.
And sometimes it’s a single ad hoc query with a new predicate that starts the slowly-ticking time bomb.
Here is how it happens . . .
When you first create a new table, it has no statistics. The default 10g analyse job picks it up in the night and analyses the table. The next day, all queries start using the CBO (Cost Based Optimizer), and column usage statistics start to gather.
What are column usage statistics? They are statistics collected in the system table
sys.col_usage$. According to various sources, information is flushed every 15-or-so minutes.
col_usage$ has various columns with counts:
SQL> desc sys.col_usage$; Name Type Nullable Default Comments
----------------- ------ -------- ------- --------OBJ# NUMBER Y INTCOL# NUMBER Y EQUALITY_PREDS NUMBER Y EQUIJOIN_PREDS NUMBER Y NONEQUIJOIN_PREDS NUMBER Y RANGE_PREDS NUMBER Y LIKE_PREDS NUMBER Y NULL_PREDS NUMBER Y TIMESTAMP DATE Y
Each of these columns will have a count for the number of times this type of predicate has been used. I don’t fully understand when the timestamp is updated but it is related to when the row was added or updated (but not all updates according to my tests).
Once you have some column usage statistics and your table has been modified sufficiently to be a candidate for new statistics to be gathered, it will be analyzed during the night. Based on internal rules, histograms will be gathered for certain columns. During my testing, I’ve observed that only columns with cardinality greater than 1.5 will have histograms. They also need to have at least equality predicates, possibly other predicates (like
range or like) trigger histograms collections as well.
Then the next time your query is parsed, it will use the new statistics data and possibly select a better execution plan for your query. For its decision, the optimizer will use the bind values of only the first execution of this query.
Here’s the sequence of events.
New table -> nightly job analyses, no histograms -> new queries will populate
col_usage$ -> table changes sufficiently to be a candidate for analysis -> table is re-analyzed with histograms -> as queries are aged out from shared pool, they are re-parsed with histograms and bind variable peeking.
Where’s the devil here?
A single ad hoc query can drastically change the statistics of a table. This may trigger the gathering of histograms (whenever the next analyze runs), and this, in turn, might enable a time bomb that, based on which bind values were used at parsing time, determines the execution plan of the query.
It’s a long sequence of events, and you may argue it’s very unlikely that this will ever happen. Consider this, though — every ad hoc query will only add to the column usage statistics.
One of the major problems here may be that there might be weeks between each step. From the time someone runs an ad hoc query, to the time the table has changed sufficiently to be re-analyzed, to the time the right load/delay/randomness happens and a really inefficient query plan is selected.
What is the solution? Disable bind peeking.
It is a feature I’ve never liked. To me, bind peeking is a workaround to enable histogram usage with bind variables. In my experience, this feature has caused me more headaches than saved-the-day moments.
This issue is somewhat addressed in 11g with the addition of multiple execution plans based on different bind variables. I haven’t tested yet, so I can’t say how well that works.
For now, I just turn off bind peeking.