What's eating my database? Where to focus performance efforts using Oracle AWR

Jared Still

September 6, 2016

No doubt you have at times been on the DBA side of this conversation, or one much like it. Help Desk: We have reports that the database is slow DBA: Really? OK, tell me when it was slow and I will investigate. Help Desk: Well, we don't really know just when that was. Several people have called and reported that at times it has been nearly unresponsive. The problem is not consistent and they don't know what time that was. DBA: Umm... OK, I will see what I can do. Or perhaps you take care of databases for third parties, and you receive similar reports, again with no exact times, or any indication of just how 'slow' the database was. At times like this you just need to find the periods of highest activity and start looking there. But, how to determine the periods of highest activity? Traditional methods would include looking at CPU, load and memory statistics from sar or other sources. Perhaps a better method however is just to let the Oracle database tell you where to concentrate your efforts. The data is all available in AWR, all that is necessary is to go find it. The gold standard for determining database activity is the Average Active Session (AAS), a metric that appeared when Oracle introduced ASH and AWR as performance monitoring tools. AAS measures just how active the database is , and is defined as "number of sessions that are either working or waiting" at a point in time. Armed with that information you can now dig in to AWR and find the most active times from an AAS perspective Initial efforts to find the top 10 most active periods in a database looked like this, where the following SQL fragment is used rank AWR snapshots by the maximum AAS value, and return the 10 most active snapshots. When doing performance analysis via AWR it is a good idea to preserve the areas of interest via AWR baselines. Doing so allows querying AWR snapshots that might otherwise be deleted due to AWR retention times. [code language="sql"] --aas-1.sql with aas as ( select snap_id , value , dbid , instance_number , rownum my_rownum from ( select distinct snap_id, dbid, instance_number, max(value) over (partition by snap_id order by snap_id) value from dba_hist_sysmetric_history where metric_name = 'Average Active Sessions' order by value desc ) ), top10aas as ( select sn.begin_interval_time begin_time , aas.instance_number , aas.snap_id , aas.value , aas.dbid from aas join dba_hist_snapshot sn on sn.snap_id = aas.snap_id and sn.instance_number = aas.instance_number where my_rownum <= 10 order by value desc ) select * from top10aas [/code] That seemed to work well until some apparently spurious values showed up. A server having 32 cores was seen to have an AAS of 3000+ during a period of time in one AWR snapshot. As there was no other evidence (CPU, Memory, IO) to support this, it seemed it was likely in error. So rather than rely on AAS, a different metric was chosen. The number of sessions per snapshot that are on CPU, or waiting on some non-idle event are now being counted as db_time. The ratio of db_time:elapsed_time is now being considered to determined the most active periods. [code language="sql"] --aas-2.sql with aas as ( select snap_id, value, dbid, instance_number, rownum my_rownum from ( select snap_id , round(db_time / elapsed_time,1) value , dbid , instance_number from ( select distinct h.snap_id , h.instance_number , h.dbid , count(*) over (partition by h.snap_id, h.instance_number) * 10 db_time , (extract( day from (s.end_interval_time - s.begin_interval_time) )*24*60*60)+ (extract( hour from (s.end_interval_time - s.begin_interval_time) )*60*60)+ (extract( minute from (s.end_interval_time - s.begin_interval_time) )*60)+ (extract( second from (s.end_interval_time - s.begin_interval_time))) elapsed_time from dba_hist_active_sess_history h join dba_hist_snapshot s on s.snap_id = h.snap_id and s.instance_number = h.instance_number where ( wait_class is null -- on CPU or wait_class != 'Idle' -- ignore idle events ) ) order by 2 desc ) ), top10aas as ( select sn.begin_interval_time begin_time , aas.instance_number , aas.snap_id , aas.value , aas.dbid from aas join dba_hist_snapshot sn on sn.snap_id = aas.snap_id and sn.instance_number = aas.instance_number where my_rownum <= 10 order by value desc ) select * from top10aas [/code] In hindsight it may be a good idea to rename sub-factored queries from top10aas and similar to topactivity, or something like that. On the other hand, if no renaming takes place, the methods can be switched back and forth without affecting any dependent scripts. Here are examples from an active test database. Some of the same periods appear in each report, albeit in different order. The first example is using the query on dba_hist_sysmetric_history: [code language="text"] SQL> @aas-1 BEGIN_TIME INSTANCE_NUMBER SNAP_ID VALUE DBID ------------------------------ --------------- ---------- ---------- ---------- 15-JUL-16 09.00.43.624 AM 1 18087 294.076413 4107786006 21-JUL-16 11.00.59.752 AM 1 18233 126.210386 4107786006 09-AUG-16 05.00.40.556 AM 1 18683 77.0418398 4107786006 06-JUL-16 05.00.33.517 AM 1 17867 75.36815 4107786006 24-JUL-16 09.00.35.396 AM 1 18303 75.131809 4107786006 06-AUG-16 04.00.42.262 AM 1 18610 56.5065604 4107786006 17-AUG-16 05.00.20.844 AM 1 18875 54.1554518 4107786006 26-AUG-16 04.00.23.145 AM 1 19090 53.2953573 4107786006 09-JUL-16 05.00.37.146 AM 1 17939 51.2031095 4107786006 15-JUL-16 04.00.28.880 AM 1 18082 50.1482053 4107786006 10 rows selected. [/code] Next are the results of the query that is considering db_time: [code language="text"] SQL> @aas-2 BEGIN_TIME INSTANCE_NUMBER SNAP_ID VALUE DBID ------------------------------ --------------- ---------- ---------- ---------- 12-AUG-16 01.00.52.943 AM 1 18751 27.7 4107786006 28-JUL-16 01.00.28.734 AM 1 18391 26.4 4107786006 05-AUG-16 01.00.43.076 AM 1 18583 25.4 4107786006 19-AUG-16 01.00.51.324 AM 1 18919 23.9 4107786006 09-AUG-16 01.00.25.700 AM 1 18679 23 4107786006 15-JUL-16 01.00.15.981 AM 1 18079 22.5 4107786006 08-JUL-16 01.00.52.888 AM 1 17911 21.4 4107786006 26-AUG-16 01.00.09.871 AM 1 19087 21.4 4107786006 29-JUL-16 01.00.31.166 AM 1 18415 21.2 4107786006 24-JUL-16 10.00.44.598 AM 1 18304 20.2 4107786006 10 rows selected. [/code] You probably noticed there is not a 1:1 match of periods chosen between these two methods. Due to the difference in these metrics, there will never be a 1:1 match between the two methods, however both methods are valid and will find periods of high activity. Now that you can identify the top 10 activity periods, what is the next step? Perhaps you would like to run AWR/ADDM reports for these times, or drill down with some custom queries. However you proceed, it may be a good idea to create an AWR Baseline for each of the snapshots. Don't worry about cleaning up those baselines later, as you will see they can be set to automatically expire at a later date. The following SQL fragment is based on aas-2.sql as seen previously. This SQL fragment will be called by create-baselines.sql. Putting the top10 logic in the SQL fragment allows modifying how the top10 list is generated without making any changes to create-baselines.sql. [code language="SQL"] with aas as ( select begin_snap_id, end_snap_id, value, dbid, instance_number, rownum my_rownum from ( select begin_snap_id , lead(begin_snap_id,1) over (partition by dbid, instance_number order by begin_snap_id) end_snap_id , round(db_time / elapsed_time,1) value , dbid , instance_number from ( select distinct h.snap_id begin_snap_id , h.instance_number , h.dbid , count(*) over (partition by h.snap_id, h.instance_number) * 10 db_time , (extract( day from (s.end_interval_time - s.begin_interval_time) )*24*60*60)+ (extract( hour from (s.end_interval_time - s.begin_interval_time) )*60*60)+ (extract( minute from (s.end_interval_time - s.begin_interval_time) )*60)+ (extract( second from (s.end_interval_time - s.begin_interval_time))) elapsed_time from dba_hist_active_sess_history h join dba_hist_snapshot s on s.snap_id = h.snap_id and s.instance_number = h.instance_number and ( h.wait_class is null -- on CPU or h.wait_class != 'Idle' -- wait events - ignore idle events ) -- these predicates are useful if you have some idea of the date range -- -- most recent 5 days -- and s.end_interval_time > systimestamp - 5 -- -- or maybe a range -- and s.end_interval_time between timestamp '2016-08-01 00:00:01' and timestamp '2016-08-02 23:59:59' ) order by value desc ) ), top10 as ( select sn.begin_interval_time begin_time , aas.instance_number , aas.begin_snap_id , aas.end_snap_id , aas.value , aas.dbid from aas join dba_hist_snapshot sn on sn.snap_id = aas.begin_snap_id and sn.dbid = aas.dbid and sn.instance_number = aas.instance_number where my_rownum <= 10 order by value desc ) [/code] Now for the script to create the baselines. First we will see if there are already any baselines: [code language="SQL"] SQL> @show-awr-baselines select baseline_name , baseline_type , start_snap_id , to_char(start_snap_time,'yyyy-mm-dd hh24:mi:ss') start_snap_time , to_char(creation_time,'yyyy-mm-dd hh24:mi:ss') creation_time , to_char(creation_time+expiration,'yyyy-mm-dd hh24:mi:ss') expiration from dba_hist_baseline order by creation_time; START BASELINE_NAME BASELINE TYPE SNAP ID START_SNAP_TIME CREATION_TIME EXPIRATION -------------------------------------------------- --------------- --------- -------------------- ------------------- -------------------- SYSTEM_MOVING_WINDOW MOVING_WINDOW 32588 2016-08-22 13:00:15 2011-12-19 16:45:11 1 row selected. [/code] No baselines currently. Now run create-baselines.sql. First the code, and then the results [code language="SQL"] set serveroutput on size unlimited var n_expire_days number -- change to something sensible for real use exec :n_expire_days := 1 declare v_baseline_pfx varchar2(30) := 'dw'; -- tag as Data Warehouse v_baseline_name varchar2(128); i_expire_days integer := :n_expire_days; e_baseline_exists exception; pragma exception_init(e_baseline_exists, -13528); procedure p ( p_in varchar2) is begin dbms_output.put(p_in); end; procedure pl ( p_in varchar2) is begin p(p_in); dbms_output.put_line(null); end; begin dbms_output.put_line(lpad('=',30,'=')); for aasrec in ( @@top10 select begin_time, instance_number, begin_snap_id, end_snap_id, value, dbid from top10 ) loop pl(' begin_time: ' || aasrec.begin_time); pl('begin snap_id: ' || aasrec.begin_snap_id); pl(' end snap_id: ' || aasrec.end_snap_id); pl(' Metric Value: ' || aasrec.value); -- create the baselines -- catch errors if already exists begin v_baseline_name := v_baseline_pfx || '_' || to_char(aasrec.begin_snap_id) || '_' || to_char(aasrec.begin_time,'yyyymmdd-hh24:mi:ss'); pl('Baseline Name: ' || v_baseline_name); --/* dbms_workload_repository.create_baseline( start_snap_id => aasrec.begin_snap_id, end_snap_id => aasrec.end_snap_id, baseline_name => v_baseline_name, dbid => aasrec.dbid, expiration => i_expire_days ); --*/ exception when e_baseline_exists then pl('!!Baseline ' || v_baseline_name || ' already exists'); when others then raise; end; pl(lpad('=',30,'=')); end loop; end; / [/code] And now the output of the PL/SQL procedure: [code language="text"] Elapsed: 00:00:00.01 ============================== begin_time: 27-JUL-16 08.00.56.132 PM begin snap_id: 31972 end snap_id: 31973 Metric Value: .5 Baseline Name: dw_31972_20160727-20:00:56 ============================== begin_time: 23-JUL-16 08.00.10.334 AM begin snap_id: 31888 end snap_id: 31889 Metric Value: .4 Baseline Name: dw_31888_20160723-08:00:10 ============================== begin_time: 15-AUG-16 08.00.18.055 PM begin snap_id: 32428 end snap_id: 32429 Metric Value: .4 Baseline Name: dw_32428_20160815-20:00:18 ============================== begin_time: 16-AUG-16 02.00.33.705 PM begin snap_id: 32446 end snap_id: 32447 Metric Value: .3 Baseline Name: dw_32446_20160816-14:00:33 ============================== begin_time: 18-AUG-16 08.00.17.095 PM begin snap_id: 32500 end snap_id: 32501 Metric Value: .3 Baseline Name: dw_32500_20160818-20:00:17 ============================== begin_time: 19-AUG-16 12.00.13.883 AM begin snap_id: 32504 end snap_id: 32505 Metric Value: .3 Baseline Name: dw_32504_20160819-00:00:13 ============================== begin_time: 09-AUG-16 08.00.15.699 PM begin snap_id: 32284 end snap_id: 32285 Metric Value: .3 Baseline Name: dw_32284_20160809-20:00:15 ============================== begin_time: 21-AUG-16 04.00.30.241 AM begin snap_id: 32556 end snap_id: 32557 Metric Value: .3 Baseline Name: dw_32556_20160821-04:00:30 ============================== begin_time: 24-JUL-16 01.00.11.481 PM begin snap_id: 31917 end snap_id: 31918 Metric Value: .3 Baseline Name: dw_31917_20160724-13:00:11 ============================== begin_time: 25-JUL-16 06.00.12.365 PM begin snap_id: 31946 end snap_id: 31947 Metric Value: .3 Baseline Name: dw_31946_20160725-18:00:12 ============================== PL/SQL procedure successfully completed. [/code] Another run of show-awr-baselines.sql to confirm: [code language="SQL"] SQL> @show-awr-baselines START BASELINE_NAME BASELINE TYPE SNAP ID START_SNAP_TIME CREATION_TIME EXPIRATION --------------------------- --------------- --------- -------------------- ------------------- -------------------- SYSTEM_MOVING_WINDOW MOVING_WINDOW 32588 2016-08-22 13:00:15 2011-12-19 16:45:11 dw_31888_20160723-08:00:10 STATIC 31888 2016-07-23 09:00:24 2016-08-30 12:45:03 2016-08-31 12:45:03 dw_31946_20160725-18:00:12 STATIC 31946 2016-07-25 19:00:26 2016-08-30 12:45:03 2016-08-31 12:45:03 dw_32446_20160816-14:00:33 STATIC 32446 2016-08-16 15:00:47 2016-08-30 12:45:03 2016-08-31 12:45:03 dw_32504_20160819-00:00:13 STATIC 32504 2016-08-19 01:00:28 2016-08-30 12:45:03 2016-08-31 12:45:03 dw_31917_20160724-13:00:11 STATIC 31917 2016-07-24 14:00:26 2016-08-30 12:45:03 2016-08-31 12:45:03 dw_31972_20160727-20:00:56 STATIC 31972 2016-07-27 21:00:10 2016-08-30 12:45:03 2016-08-31 12:45:03 dw_32428_20160815-20:00:18 STATIC 32428 2016-08-15 21:00:32 2016-08-30 12:45:03 2016-08-31 12:45:03 dw_32500_20160818-20:00:17 STATIC 32500 2016-08-18 21:00:31 2016-08-30 12:45:03 2016-08-31 12:45:03 dw_32284_20160809-20:00:15 STATIC 32284 2016-08-09 21:00:29 2016-08-30 12:45:03 2016-08-31 12:45:03 dw_32556_20160821-04:00:30 STATIC 32556 2016-08-21 05:00:44 2016-08-30 12:45:03 2016-08-31 12:45:03 11 rows selected. [/code] Please note the default retention time in the script is 1 day; you most likely will change that to something more reasonable, such as 30 days. The AWR snapshots preserved by the new baselines will not be aged out as per standard AWR retention policies, allowing ample time to run AWR/ADDM reports on periods of interest, or query AWR directly for these periods so you can drill down on the root cause of performance issues and find the bottlenecks. Code from this article may also be found at https://github.com/jkstill/oracle-baselines

Insight and analysis of technology and business strategy

What's eating my database? Where to focus performance efforts using Oracle AWR

Top Categories

Tell us how we can help!