For the past few months I’ve been using a query that I refer to as “ash report – recent spike”. That’s the second thing I do when I get a call of the “the system is slow” type. The first thing I do is run “top” (or whichever alternative for the OS) and check the overall CPU usage.
The script is fully RAC-aware, and although it’s not 100% perfect, I use this imperfection to see if any particular node is doing something stupid. Although it is primarily targeted for OLTP systems, it can be useful for data warehouses as well, especially if they use the parallel option.
The query is to the database what “load” (uptime) for a Linux/Unix machine is, except it has much more detail. It is basically a summarization query of the
v$active_session_history table. NOTE: you need to have the performance pack license to use it. It is not designed to be aligned, or read. The best is to leave it on just a few lines and concentrate on the results.
It has two “variables” that you can adjust: how far back to look (I use two hours), and how aggressively to look for problems (
having count(*) >= 2).
An explanation of how to make sense of the results follows the query.
select round(avg(max(cnt_tot)) over (order by sample_time RANGE BETWEEN INTERVAL '5' minute PRECEDING AND current row)) as avg,max(cnt_tot) as tot, SAMPLE_time,max(cnt) as cnt, event,substr(sq.sql_text,1), ash.sql_id, ash.sql_child_number chd,/* plsql_entry_object_id, plsql_entry_subprogram_id, plsql_object_id, plsql_subprogram_id, session_state, qc_session_id, qc_instance_id, blocking_session, blocking_session_status, blocking_session_serial#, event_id, event#, seq#, p1text, p1, p2text, p2, p3text, p3, wait_class, wait_class_id, wait_time, time_waited, xid, current_obj#, current_file#, current_block#, program, module, action, client_id*/to_char(round(sum(elapsed_time) / nullif(sum(executions), 0) / 1000000, 6), '9,999,999,990.999999') as "sec p",round(sum(disk_reads) / nullif(sum(executions), 0), 0) as "disk p", round(sum(buffer_gets) / nullif(sum(executions), 0), 0) as "gets p",round(sum(rows_processed) / nullif(sum(executions), 0), 0) as "rows p",round(sum(cpu_time) / 1000000 / nullif(sum(executions), 0), 3) as "cpu p", sum(executions) as exec, sum(users_opening) as open,sum(users_executing) as e from ( select sum(count(*)) over (partition by sample_time) as cnt_tot, count(*) as cnt,SAMPLE_time, event,sql_id,sql_child_number from gv$active_session_history where 1=1 and sample_time > sysdate - interval '2' hour group by event, sql_id,sql_child_number, SAMPLE_time having count(*) >= 2) ash, gv$sql sq where ash.sql_id = sq.sql_id(+) and ash.sql_child_number=sq.child_number (+) group by event,sql_text,ash.sql_id,ash.sql_child_number, ash.SAMPLE_time order by sample_time desc
The query concentrates on the idea, “you shouldn’t have too many sessions doing the same thing at the same moment”. Basically, anytime you have multiple sessions running the same query and waiting on the same event, they are grouped together.
Here’s the explanations of each column:
AVG — the average “load” (active sessions) over a 5-minute interval. This should help you spot a problem when you scroll through the results.
TOT — total “load” (active sessions) for that sample time. RAC users: each RAC node will have its own sample time, within 1 second of each other, but not exactly spot-on. So, even if you have sessions waiting on the same event, they will not be grouped together. I kind of like it this way, for now.
SAMPLE_TIME — self-explanatory
CNT — the number of active sessions waiting on the same event and query
EVENT — the event been waited on
SQL_TEXT — self-explanatory, except when empty which means either not found in shared pool or not available in ASH
SQL_ID — if you need to find the SQL
CHD — the child number being executed
NOTE: the following columns are as of “now” and not as of the
cpu p — these are average statistics for the query being executed. This should give you a quick overview of whether the query is a big query, a small query, a CPU-intensive or a disk I/O-intensive query. Be careful: since the query was introduced to the shared pool, those columns are averages, and therefore could be misleading. These are all per execution stats, so
sec p represents the number of seconds on average it took to execute the query.
OPEN — comes from
v$sql: the number of sessions that have the query open
E — the number of sessions actively executing the query
I generally use
having count(*) >= 2 — the lowest reasonable setting to get an overview of what’s happening on the server. This usually doesn’t show many rows on my servers. It really depends on how busy your server is. You should play with this “filter” to see only what you are interested in.
If you are looking for a problem, you can raise that to, say, 10 and remove the time restraint in order to find times when the database was particularly busy.
Feel free to ask any questions if you have any.