Select Statement Generating Redo and Other Mysteries of Exadata

Posted in: Technical Track

Like many good stories, this one also started with an innocent question from a customer:

“I often check “SELECT COUNT(*) FROM TABLE;” on our old cluster and the new Exadata.
The Exadata is slower than our old cluster by few minutes.
Exadata is not good at count(*) ? or other problem?”

To which I replied:
“Exadata should be AMAZING at count(*)…
so, we definitely need to check why it took so long…
which query?”

Exadata should do count(*) either from memory or by using smart scans, and since my friend is counting fairly large tables that won’t all fit into memory, Exadata should be significantly faster than their old system.

So, I started by trying to reproduce the issue. The customer sent me his slow query, and I ran it. The same query that took him 3 minutes to run, took me 2 seconds. Clearly we were not doing the same thing, and disk access was then our main suspect.

To see what his executions indicate, I asked the customer to run the same script adding “SET AUTOTRACE ON” at the top and send me the script and the results.
“SET AUTOTRACE ON” is far from perfect, but an easy way to get an overview of the issue without annoying the customer into non-compliance.

Here’s the statistics section for one of the queries:

0  				recursive calls
0  				db block gets
54903  			        consistent gets
54899 			        physical reads
2855188  		        redo size
552  			        bytes sent via SQL*Net to client
520  			        bytes received via SQL*Net from client
2  				SQL*Net roundtrips to/from client
0  				sorts (memory)
0  				sorts (disk)
1  				rows processed

There was a lot of disk access, but I was distracted by the huge redo size. Why was a select statement generating so much redo?

My best guess was delayed block cleanout . A quick twitter survey added another vote for cleanouts and a suggestion to check for audits. Checking for audits was easy, we had no audit on selects. But how do I confirm whether or not I’m doing block cleanouts?

The tweeps suggested: “First trace it. Then you have concrete evidence.” Trace is often the performance tuning tool of choice, since it shows very detailed information about the queries that run and their wait events. But trace is not the perfect tool for everything – “delayed block cleanout” is neither a wait event nor a query, so this information will not show up in a trace file.
Oracle keeps track of cleanouts using “delayed block cleanout” statistics. V$SESSTAT has this information, but it is cumulative. I will have to take a snapshot of V$SESSTAT before and after each query. If only there was a tool that would make it easier…

Snapper to the rescue!

I ran Snapper for snapshots every 5 seconds during the time the scripts ran. Why not use Snapper’s multi-snapshot ability? Because it only displays the information after the run, and my ADHD mind wanted constant feedback like I got from top and vmstat.

 sqlplus / as sysdba @snapper ash,stats 5 1 2021

And I soon caught the following evidence:

    SID, USERNAME  , TYPE, STATISTIC                                                 ,     HDELTA, HDELTA/SEC,    %TIME, GRAPH
   2021, SYS       , STAT, physical read requests optimized                          ,      8.53k,      1.71k,
   2021, SYS       , STAT, physical read total bytes optimized                       ,     69.91M,     13.98M,
   2021, SYS       , STAT, redo entries                                              ,      8.53k,      1.71k,
   2021, SYS       , STAT, redo size                                                 ,     443.8k,     88.76k,
   2021, SYS       , STAT, redo entries for lost write detection                     ,      8.53k,      1.71k,
   2021, SYS       , STAT, redo size for lost write detection                        ,    443.61k,     88.72k,
   2021, SYS       , TIME, DB CPU                                                    ,    658.9ms,   131.78ms,    13.2%, |@@        |
   2021, SYS       , TIME, sql execute elapsed time                                  ,      4.01s,   801.59ms,    80.2%, |@@@@@@@@  |
   2021, SYS       , TIME, DB time                                                   ,      4.01s,   801.59ms,    80.2%, |@@@@@@@@  |
   2021, SYS       , WAIT, cell single block physical read                           ,      4.22s,   843.51ms,    84.4%, |@@@@@@@@@ |

Leaving out a lot of boring stuff, we can immediately see that we do not have any delayed block cleanout. Instead we have redo…for lost write detection.
What is this lost write detection? A quick Google query led me to a parameter that was added in version 10g: DB_LOST_WRITE_PROTECT . The documentation clearly said:
“When the parameter is set to TYPICAL on the primary database, the instance logs buffer cache reads for read-write tablespaces in the redo log, which is necessary for detection of lost writes.”

In the old cluster, the parameter was unset and defaulted to NONE. In Exadata X2-2, the default is TYPICAL. Maybe Oracle knows something that we don’t about write safety on Exadata? Regardless, when we set the parameter to NONE, the results were very different:

0  			recursive calls
0  			db block gets
12153  		        consistent gets
10808  		        physical reads
0  			redo size
552  		        bytes sent via SQL*Net to client
520  		        bytes received via SQL*Net from client
2  			SQL*Net roundtrips to/from client
0  			sorts (memory)
0  			sorts (disk)
1  			rows processed

We won! Wait…did we?

Understanding and getting rid of the mysterious REDO is great. But in follow up tests, we saw that every time the query had to touch disk, Exadata was still significantly slower than the cranky old SAN we used in the old cluster.

Let’s go back to Snapper.

Snapper output showed that 84.4% of the time was spent on “cell single block physical read”, and the statistics show that it is all “optimized” – meaning read from Exadata’s flash cache. It’s nice that the reads are fast, but why do we use “cell single block physical read”?

I turned to the execution plans, incidentally the one piece of information Snapper needed for troubleshooting that Snapper doesn’t show you.

Here’s the plan in the old cluster:

| Id  | Operation              | Name                    | Rows  | Cost (%CPU)| Time     | Pstart| Pstop |
|   0 | SELECT STATEMENT       |                         |     1 |     5   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE        |                         |     1 |            |          |       |       |
|   2 |   PARTITION LIST ALL   |                         |  1562 |     5   (0)| 00:00:01 |     1 |    12 |
|   3 |    INDEX FAST FULL SCAN| IDX_XXXX                |  1562 |     5   (0)| 00:00:01 |     1 |    12 |

And in the new:

| Id  | Operation           | Name               | Rows  | Cost (%CPU)| Time | Pstart| Pstop |
|   0 | SELECT STATEMENT    |                    |     1 |     0   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE     |                    |     1 |            |          |       |       |
|   2 |   PARTITION LIST ALL|                    |  1562 |     0   (0)| 00:00:01 |     1 |    12 |
|   3 |    INDEX FULL SCAN  | IDX_XXX            |  1562 |     0   (0)| 00:00:01 |     1 |    12 |

Do you see the difference? Are you surprised to learn that Exadata will apply smart scans on INDEX FAST FULL SCAN, but not on INDEX FULL SCAN?

Why did Oracle choose INDEX FULL SCAN instead of FAST FULL SCAN?

There could be many reasons: Maybe the index is not unique or allows nulls. Or maybe our optimizer parameters make single block access too attractive. Or maybe we have bad statistics.

Since it’s a short list, I eliminated the options one by one. Primary key exists, and optimizer parameters were all default. Table statistics was fine, so I was almost fooled. But index statistics indicated zero rows!

After gathering statistics, things looked better:

| Id  | Operation                      | Name                    | Rows  | Cost(%CPU)| Time     | Pstart| Pstop  |
|   0 | SELECT STATEMENT               |                         |     1 |     5   (0)| 00:00:01 |       |       |
|   1 |  SORT AGGREGATE                |                         |     1 |            |          |       |       |
|   2 |   PARTITION LIST ALL           |                         |  1562 |     5   (0)| 00:00:01 |     1 |    12 |
|   3 |    INDEX STORAGE FAST FULL SCAN| IDX_XXX		 |  1562 |     5   (0)| 00:00:01 |     1 |    12 |

The customer confirmed: “Shocking speed lol”

Lessons Learned:

  1. Avoid premature guesses – You learn much more by getting information and analysing the facts.
  2. Snapper shows both wait events and session statistics, allowing deeper troubleshooting than just tracing.
  3. Snapper is also easy to use, making it my preferred performance troubleshooting tool.
  4. Snapper would be perfect if it also retrieved query plans.
  5. 90% of performance problems are bad statistics, even if the symptoms look more exotic at first.
  6. Oracle keeps adding new behaviors, and no one can know when something new will show up. Thank you for keeping my job exciting.


Want to talk with an expert? Schedule a call with our team to get the conversation started.

10 Comments. Leave new

Select statement generating redo – and lost write detection | Tanel Poder's blog: IT & Mobile for Geeks and Pros
October 31, 2012 7:09 am

[…] Shapira has written a nice summary of a problem case where the classic wait interface based troubleshooting method is not always […]


Hi Gwen,

When you say “In Exadata X2-2 the default is TYPICAL” – DB_BLOCK_CHECKING is a database-level parameter so how does the database being on an X2-2 influence the default?

Oracle Sun Database Machine Setup/Configuration Best Practices [ID 1274318.1] suggests a value of FALSE, without limiting it to a particular Exadata revision.
However, it does also say “Refer to MOS Note 1302539.1.”. Which in turns seems to recommend, a roundabout way, setting it to FULL.
I raised an SR about this seemingly confusing advice and the end result was – and I’m paraphrasing here – “Set it to match your requirements and test the performance impact”.


Hi Tim,

First, note that DB_LOST_WRITE_PROTECT is different feature from DB_BLOCK_CHECKING. Only one of them causes extra redo :)

Exadata has its own DBCA templates and DB_LOST_WRITE_PROTECT=TYPICAL is in there. The templates don’t always match the MOS notes 100%, I noticed.

Set to match requirements is always a good idea!
I’d say, if you don’t have performance problems, leaving it at TYPICAL is just fine.

Link sql_id and oracle stats « bdt's oracle blog
November 8, 2012 2:28 pm

[…] You could also use the well known snapper script as Gwen Shapira who used it on a real practical case in this post. […]


Three things strike me as odd. Shouldn’t dynamic sampling help address the lack of index statistics…assuming that is “on” of course?

Looking at the plans…their cost and time columns seem almost identical. A 5% CPU usage measurement shouldn’t kill the performance of the query?

Are indexes relevant in the context of Exadata and this kind of query? Flash cache everything especially for a query doing a full table scan, then let’r rip…who needs indexes?

…just curious.


Dynamic Sampling was not activated. I don’t remember if it was completely disabled, or whether the combination of table statistics existing and index statistics missing caused it not to be activated.

I know it didn’t kick in because when it does, it appears in the “notes” section of the plan. But I didn’t check why it didn’t kick in.

The test itself consisted of large number of queries running in succession. None of the queries had a huge difference between Exadata time and 10.2 cluster times, but it accumulated into something noticeable. Combine this with the expectation that Exadata will leave the old cluster in the dust, and you got very puzzled DBAs.

Note that after we collected stats and got the fast full scan, it was offloaded to storage cells and now the Exadata was significantly faster than the old cluster. Just as we expected.

The questions of whether indexes are relevant, how Flash Cache is related (hint: for full scan, it actually isn’t) – its a complex topic and deserves it own (rather long) blog post. The answer meanwhile is “it depends” – but the application I was testing there, absolutely needed indexes.


To put a finer point on items 5 and 6 of “lessons learned”, I would suggest that some aspects of the Exadata optimizer are still immature and very much a work in progress. As you allude to it’s often exciting when dealing with Oracle in this context and your mileage will vary, in unexpected ways, when trying to drive performance through Exadata.

When I consulted on a large conversion moving from v6 (of the rules based engine) to v7 (of the new “intelligently” optimized Oracle engine…rules based SQL was now no longer needed…) it was a disaster. Upon migration the performance of the database immediately and unexpectedly went into the dumpster. We quickly realized that we had better turn off this newly introduced Optimizer thing. It took the optimizer team at Oracle about 24 months to iron things out and in v7.3 the optimizer was finally stable and useable.

To a somewhat lesser extent, I think we’re seeing the same thing manifest here. The hardware is way ahead of the software. Exadata is a vastly different architecture and the Oracle optimizer, arguably one of the most complex pieces of the engine, must be “newly” ported to this hardware. That porting process takes time because of the inherent complexity of the optimizer itself and the added nuances of the Exadata architecture. Thus, I think, sub optimal plans sometimes manifest when it’s not clear why or that they should appear this way at all….exciting indeed.


Completely agree, Paul.

Except to point out that in this particular case, Exadata is not at fault. Every 11gR2 upgrade and failure to collect stats would have resulted in the same problem. Exadata just allowed an extra speed boost by offloading the fast-scan plan.


Leave a Reply

Your email address will not be published. Required fields are marked *