I made this blog title intentionally provocative. However, it isn’t far from the truth. Don’t believe me? Continue reading. :)
Emergency call
I came up with the graph bellow to demonstrate the problem. I was called on an emergency to help a desperate client to take their Oracle E-Business Suite system’s performance under control yesterday. After researching for several hours, we found the root cause of the problem. The graph reflects the correlation between direct reads percentage and all physical reads. You would say that looking at aggregated data and percentages is a bad style and a waste of time. Well, I will leave it to you to decide. I think the graph demonstrates the problem very well, and this is all that counts.
You do not need to be a scientist to tell when the system has been upgraded to the 11.2.0.3 version. It was kind of obvious looking at the graph, right? Well, it took us a bit of time to find where the problem was. To save you some time, I put together a SQL that you can use to build a graph just like this one.
Did you migrate your Oracle database to 11.2.0.2 or a higher version recently? “Are you ready see how deep the rabbit hole goes?” (c) Matrix. Here you go. :)
direct_reads_11g_problem.sql
If you see a significant change in the pattern after an upgrade to 11G, you may be interested in the next set of statistics.
The blue line represents the Physical Reads statistic. The red line is the Direct Reads. Did you notice that Physical read volume increased from 100MB/s to 400MB/s? Well my client didn’t just notice it. An expensive storage array’s resources got saturated and the whole user community was paralyzed for several days. A good portion of the first day, the client worked with a storage vendor as all other Oracle IOs (e.g. log file sync, db file scattered read, db file scattered read, etc.) became very slow. Many hours of work with Oracle Support followed. After running out of ideas, the client called us. If we would have had the SQL handy, we would have avoided 1 day of troubleshooting efforts and focused our efforts on how to fix the issue. Keep the script handy. Who knows, it may save you some troubleshooting efforts.
Other recently migrated client
Following up on the emergency call’s results, out of curiosity, I decided to check another client’s environment that recently migrated to 11.2.0.3 version. Have a look. I don’t think I need to say much.
And the Direct Reads contribution to total Physical Reads graph for the same database looks like the following:
Quick fix
We are still discussing how to resolve the issue for good. The following two actions have been implemented as a quick fix to give a bit of relief to the system’s user community.
- Disable a new functionality related to Direct reads enhancement introduced in the 11.2.0.2 version. Google the event number and you will find several related blog posts including Tanel’s blog post here.
SQL> alter system set events '10949 trace name context forever';
- [see my comment bellow]
Switch offAutomatic Memory Management. IMHO, it should be switched off for any serious Oracle database. It introduces more problems than it adds value
High 'direct path read' waits in 11g [ID 793845.1]
NOTES:
- Be aware that the script is AWR based. Therefore, a database should have a Diagnostic licence in order to use it
- If you don’t have the Diagnostic licence, just use the idea and convert it to STATSPACK. (Don’t forget to share it with me and others!) :)
- Please do not hesitate to drop me a message if you spot any mistakes (I know I do those all the time) or have related comments.
Let me know if the script I have introduced to you in this blog post helped you to spot something interesting in your system.
======================================================
Added by Yury on 2012.11.24. As a result of internal discussion with Pythian folks (special thanks to Marc Billette):
Additional Comments related to this blog post
- % of DPR doesn’t mean anything. A high number could be a good or bad thing for different systems. The answer, as always: It depends. :)
- However, in my clients’ case, the DPR increase was accompanied by SIGNIFICANT physical IO increase (see my second graph).
- As the change in behavior happened right after the 11G upgrade, it was enough to point us to the temporary fix.
- A single serial DPR is more efficient (cheap from resources point of view) than traditional Oracle physical Read (db file sequential read or db file scattered read).
- If it is given that a process MUST read X physical blocks and none of table blocks are cached at the moment, the DPR will be faster and utilize less resources on the server.
- However, the DPR doesn’t cache anything. Therefore, none of the other processes can share the work the process did.
On Exadata
- The Exadata is designed to make a lot of DPR to leverage data processing offloading to cells hosts.
- In fact, I think the new serial direct reads decision mechanism was implemented because and for the Exadata (just guessing here).
- High DPR is just an indication that an environment doesn’t use buffer cache. That’s all. Is it good or bad? As always, it depends. :)
26 Comments. Leave new
Also great example to the benefits of visualizing your performance metrics.
+1 to disabling automatic memory management. Other than very small, inconsequential, test databases, I have yet to see one instance where it is worth even considering!
The thing that has to be thought out is this: automating-this-or-that does not immediately imply that the target will be ideally tuned!
“Automatic management” is *NOT* a synonym for “automatically tuned”! Far from it.
Automation takes some time to recognize a change in conditions that warrants a change in parameters. In turn, that change of parameters may involve extensive re organization of the memory and its consequent use by other sections. That consequent change in turn may trigger other boundaries that cause further automation changes, with the consequence that the system may easily be chasing its tail trying to self-tune for a situation that *might* have been relevant hours ago!
Add to that an application that might work on daily cycles like e-business suite, and you got a test case for wild oscillations.
In electronics, this situation is called a self-oscillation: a condition where control feedback in a closed system reaches and affects the input processing waaaaay after it should have done so. And the system promptly enters into potentially destructive oscillation. I say potentially destructive because that might be the intention: oscillators and wave generators work exactly on such – self-oscillation through in-phase feedback, also called positive feedback.
As opposed to out-of-phase feedback which acts to control a system and not let it oscillate outside of a small range.
Unfortunately, most current “automated” software setups of late seem to suffer from a total lack of knowledge of their developers of these basic engineering principles. ANY closed system is susceptible to oscillation through feedback, and software is certainly no exception.
I have yet to see ANY of my 11.2.0.3 databases have ANY I/O problems, quite the contrary in fact. But then again, there is not a single one of them that is using automatic memory management.
As rightly mentioned by Martin Berger (@martinberx) on tweeter https://bit.ly/RT5f1l the AMM wasn’t really disabled.
We just increased minimal limits for shared_pool_size=5G and db_cache_size=38G as the sga_target was still set to the higher value AMM was still enabled.
On the other hand it looks we could implement one of the changes to fix the issue as
— setting the event disables the Direct Reads related feature introduced in 11.2.0.2
— increasing db_cache_size impacts on the default default threshold value (2%) for the small tables
Well in this case the team didn’t want to take any changes and therefore both changes found their way in prod.
Yury
PS Now we are advising/planning to disable AMM in this envrionment for good in few weeks time :)
Hi Yury,
This post was informative great troubleshooting, just few additions
ASMM Enabled in large Oracle Apps environment indicates that ERP runs on a vanilla installation
above which there would be tonnes of application customizations that will make the ASMM confused
or mislead it with its ever confusing workload, In Prod never recommended, Sholud take the memory
advisories suggestion to set the SGA manually, and non-zero value on sga_target will ignore the
manual AMM settings, So sga_target=0 will be good :)
Whereever there are such vanilla instances running the below is one common mistake that goes unnoticed
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
And, I most cases there would be GATHER_STATS_JOB(default with 10g & above) enabled instead of
FND_STATS, If its running on a vanilla installation the performance degrades again.
If the above case is true, then it should be disabled using below command
exec DBMS_SCHEDULER.DISABLE(‘GATHER_STATS_JOB’);
Below should be the concurrent program that should be enabled in that case.
Name of the Concurrent / DBMS job: Gather Schema Statistics
Executable Name/short Name: FNDGSCST
Schedule / Submit Interval: , From start of prior run.
Parameter Reference:
Schema Name=ALL
Estimate Percent=10
Degree=3
Backup Flag=NoBackup
History Mode=None
Gather Option=Gather
Keep other parameters as defaulted.
Good point Ajith in regards the vanilla EBS instance VS EBS with customizations. I will use it in the future in discussions with my clients. Thanks you for the hint.
Hi, thanks for sharing. can you please share how did you create these graphs out of the sql you shared ?
Take h_date column horizontal ax and pyh_reads/dir_reads OR R_PCT OR pyh_mb_s/dir_mb_s as source for vertical ax. You should be good with figuring the rest.
Nice article , thank you for your effort
Maybe not necessary to disable AMM, Yury. Pegging the minimal amount of db cache and minimal amount of shared memory will ensure things don’t oscillate violently between too wide values. You’re simply forcing the AMM to not go bersek and cause other issues. “Putting on the brakes”, so to speak.
I think you guys went the right way with these changes, although likely I wouldn’t have entirely disabled the direct reads – they might serve a good purpose in some cases. Putting the “brakes” on AMM should be enough to keep things under control. Definitely keep us posted on the results once you turn AMM off completely and any observations after that.
I’m agree, “not necessary to disable AMM”, setting minimum value for key memory pools can help, and always “set up” KEEP or RECACLE pool to appropriate size, they will not be adjusted by AMM and not all memory pools are automatically resized by AMM.
What actually I hate with AMM I can’t lock the SGA to prevent “paging out” to disk, in rare situation it could be handy. Honestly I never use this in production, but I like to try in QA for instance
I like to enable AMM, the potential performance gain from AMM are significant, but everything depending on “workload”. I recently disable AMM on one instance, simply “workload” contains a huge number of sort & read intensive operations and causing shifting memory between SGA and PGA very frequently, and then single session is taking a huge amount of memory, I’m fed up with this and disable AMM.
In my opinion, it is not bad idea to enable AMM, the potential performance gain from AMM is significant, but monitoring and “setting minimum size for certain memory area” is extremely important.
By the way compatibility could be issue like AMM and LINUX &HugePages, …
So lot of plus and minus, nothing is perfect, but I’m not agreeing that AMM is useless.
Andjelko Miovcic
It’s a feature that I’m unconvinced about.
I understand that benefit that it’s meant to deliver but in reality I’ve seen it cause more problems that it resolves (but that’s because we get called about the problems, right?).
+1 for turning off AMM regardless. It probably works well on perfect systems but who’d know?
+1 for increasing the size of the buffer cache where possible. That’s my advice.
The last significant upgrade to 11gR2 that I was involved involved multiple testing cycles eventually led to a 4x increase in the BC – not just because of direct path reads – but that increase meant direct path was generally fine when used.
Should really be picked up by proper testing cycles in advance of migration.
Similar story from Coskan:
https://coskan.wordpress.com/2011/08/15/what-happens-when-you-did-not-do-a-load-test-before-release-upgrades/
Of course, the other factor of the feature is the “adaptive” part of “adaptive serial direct reads” where the behaviour can adapt to what is actually cached.
I blogged observations on this in a prod system where the plan didn’t change, the buffer cache size didn’t change, just the amount of data cached (probably) changed:
https://orastory.wordpress.com/2012/06/19/flippin-witch-hunt-adaptive-direct-path-read/
Thanks Dominic for reading, commenting and sharing good information sources with myself and the rest of community. I have added yours and Coskan’s blogs to my Google Reader.
I’ve seen this happen in two large databases that have been upgraded to 11GR2.
What seems to happen is that memory is very heavily allocated to the PGA and you end up with a very small SGA as a result. Hence you end up with ever increasing Direct Reads.
You can set SGA_TARGET and PGA_TARGET (which are minimum values) so that you leave only a small portion of your MEMORY_TARGET for the database to automatically manage.
I have also seen a suggestion (which I have not tested yet) that setting Dead Connection Detection will mitigate this problem. The reasoning being that orphaned sessions in the database that consumed a lot of PGA memory are not being cleaned up, so you end up with more memory being allocated to the PGA over time unnecessarily .
Hey Paul. As I am working with EBS environments and those tends to be relatively important to the clients (read gets a bit more DBA time than others) I don’t use MEMORY_TARGET at all and probable will not use in the nearest future. One of the disadvantages of MEMORY_TARGET is the fact that you can’t use Huge Pages in such configuration. IMHO: For SGAs over 5-10GB it starting to impact performance significantly.
I never saw anything with reads, however I did find that oracle took way more memory that it was assigned in 11.2+, even when memory target is set to around 500M, the processes for the database would spill well over that value.
Turning AMM off didn’t fix the issue.
This was on AIX 6.
https://blog.vishalgupta.com/2011/08/19/direct-path-reads-11g-changed-behaviour/
In our customers place the oracle version is 11.2.0.2. Frequently the error ‘program unit not found’appears when our APIs are invoked. Observed that some objects are uncompiled and when compiled it gets compiled automatically. There is no change in user objects done. What could be the reason? Please help.
The issue with KEEP pool leading to direct I/O was killing the performance for us.
Due to this bug, tables with size >10% of cache size, were being treated as ‘large tables’ for their reads and this resulted in execution of a new SERIAL_DIRECT_READ path in 11g.
KEEP BUFFER POOL Does Not Work for Large Objects on 11g [ID 1081553.1]
Hi Yury,
Thanks for your article! I was curious to try your case on my fresh migrated (11.2.0.3) environment. I saw completely opposite picture. Direct path read felt 3 times. All memory parameters tuned manually, though. That could be the case.
This makes for an interesting read.
I’m wondering if the effects of the “direct path” reads is more profound on systems where the O/S file system mount point options have been “tuned” to minimise O/S file system cache usage.
That would provide another penalty in performance. Not only is the work not shared at the DB level cache, but it wouldn’t be shared at the O/S file system cache either.
Maybe Oracle whitepapers should be rewritten to *not* suggest adjusting the O/S file system mount point settings with relation to file system cache if Oracle are going to use more “direct path” reads.
Hi Yury,
We also encountered this issue. An application has a medium-size table (800 MB) that gets accesses through full table scan approx 1,000 times per day (application design issue). With db_cache_size set to 7 GB this full table scan performs zero disk reads and takes 0.5 – 0.6 second. As records are added to the table it grows and at some point it exceeded 5 * _small_table_threshold and switched to direct reads. So overnight performance deteriorated from 0.5 second to 30 seconds per execution. This feels even worse then post-upgrade problem: at least one expects problems after upgrade, but who expects problems as a result of table growing from 800 MB to 810 MB? If Oracle feels the need to “promote” direct reads it should have been done gradually, without changing default behaviour.
Regarding disabling AMM: there are some reasons for keeping AMM enabled. Consider Oracle 10.2 that has ASMM. It is enabled when SGA_TARGET > 0.
Reasons for keeping ASMM enabled:
– Ability to add memory to SGA dynamically. One some platforms like AIX if SGA_TARGET < SGA_MAX_SIZE Oracle allocates memory according to SGA_TARGET.
Then if memory is dynamically added to LPAR we could dynamically add memory to SGA.
– Flexibility: a DBA may configure SGA for a given workload, but workloads tend to change. In a perfect world DBA will have opportunity to re-configure SGA, but it doesn't always happen. For example, we have databases with large number of connections (5,000). These connections go through shared server and use Large Pool. Sometimes there is a spike in the number of connections, with ASMM enabled Large Pool grows accordingly.
– Memory leaks inside SGA: Oracle has bugs that cause memory leaks inside SGA, typically in shared pool (in one sub-pool of the shared pool). For example, 9.2 had a leak that was triggered by TRUNCATE or DROP operations and caused growth of 'miscellaneous' sub-heap of the pool. 11.1 has a leak that causes growth of CCursor. 11.2 has a leak that causes growth of PRTMV sub-heap.
By enabling automatic re-configuration of SGA we give ourselves more time for dealing with these leaks.
Did anyone come across “direct path write” after the DB upgrade to 11.2.0.4 ? This is killing our database performance , huge I/O waits,our response times were increased a lot .
Please refer to MOS “ASMM versus AMM and LINUX x86-64 Hugepages Support (Doc ID 1134002.1)” posted on 19-Dec-2014 for clarification.
Thanks friend, that solved a real problem…Orcale was spinning on SSD disks for nothing.. That was the trick and it immediatly solved the problem.
Hello Yuri,
I have a table in KEEP pool – the table is partitioned by a “month” field – so we usually access only the “current month” data. The KEEP pool is also sized accordingly to be able to accommodate one month’s data.
Now there is an SQL that is accessing all month’s data and I want it to do direct path reads – but it does “db file scattered read” – probably due to KEEP pool setting. Short of setting each partition individually to KEEP pool (and updating that every month), is there any way to FORCE direct path reads for a table in the KEEP pool?