The time to respond
There was a question at Pythian a while ago on how to monitor Oracle database instance performance and alert if there is significant degradation. That got me thinking, while there are different approaches that different DBAs would take to interactively measure current instance performance, here we would need something simple. It would need to give a decisive answer and be able to say that “current performance is not acceptable” or “current performance is within normal (expected) limits”.
Going to the basics of how database performance can be described, we can simply say that database performance is either the response time of the operations the end-user do and/or the amount of work the database instance does in a certain time period – throughput.
We can easily find these metrics in from the v$sysmetric dynamic view:
SQL> select to_char(begin_time,'hh24:mi') time, round( value * 10, 2) "Response Time (ms)" from v$sysmetric where metric_name='SQL Service Response Time' TIME Response Time (ms) --------------- ------------------ 07:20 .32
So this is the last-minute response time for user calls (here in ms). We can check the throughput by checking the amount of logical blocks (it includes the physical blocks) being read, plus we can add direct reads (last minute and last several seconds output here for a database with 8 KB block):
SQL> select a.begin_time, a.end_time, round(((a.value + b.value)/131072),2) "GB per sec" from v$sysmetric a, v$sysmetric b where a.metric_name = 'Logical Reads Per Sec' and b.metric_name = 'Physical Reads Direct Per Sec' and a.begin_time = b.begin_time / BEGIN_TIME END_TIME GB per sec -------------------- -------------------- ---------- 16-jun-2013 08:51:36 16-jun-2013 08:52:37 .01 16-jun-2013 08:52:22 16-jun-2013 08:52:37 .01
We can check more historical values through v$sysmetric_summary, v$sysmetric_history and dba_hist_ssysmetric_summary.
So did these queries answer the basic question “Do we have bad performance?”? 100 MB/sec throughput and 0.32 ms for a user call? We have seen better performance, but is it bad enough that we should alert the on-call DBA to investigate in more detail and look for the reason why we are seeing this kind of values? We cannot say. We need something to compare these values to so that we can determine if they are too low or too high. It is somewhat like being in a train that passes next to another moving train, going in same direction but at a different speed. We don’t know the speed of our train, and we don’t know the speed of the other train, so we cannot answer the question “Are we going very fast?”. If we turn to the other side and see a tree passing on the other side of the train, we will be able to estimate the speed of the train (also taking into account our experience of what is very fast for a train…). So we need something that has an absolute value. In the case of the tree, we know that the tree has speed of 0 (Ok, it is not completely absolute, but we had to simplify now :) ).
So we understand that we need an absolute value or base-line, which we know represents having “bad”, “normal”, or “good” performance. How do we find these values?
Bad, normal and good
One way to establish these absolutes is to just experiment, establish when the database instance provides acceptable performance by going to the applications that uses the database and checking its response time, or run the queries that the application runs directly and determine if they complete in acceptable time (defined by the business requirements) – when you reach these results, check the database instance response time and current throughput, and carve them in stone as absolutes that can be used to compare future measurements.
The approach above may sometimes work, but when you start measuring response time, you will notice that it might go up and down wildly. You will need to define some bounds around the value you think is a “normal” response time. So a response time above this bound can be called “bad”, and we can alert that we have performance degradation.
To define this point more accurately, I would suggest using another strategy. We can make an “educated” guess on these values by analyzing them historically from the DBA_HIST_SYSMETRIC_SUMMARY view. We just need to have enough history in there.
We can find the average response time and more importantly the standard deviation of the values – this would tell us what is a “normal” response time and everything above that, a “bad” one:
The graph represents an example of a response time values distribution, while the points A and B represent the standard deviation bounds – bounds where we can say the response time is normal. Here is an example how we can determine the A and B points i.e. “normal” boundaries:
SQL> with epsilon as (select avg(average - STANDARD_DEVIATION ) m1, avg(average + STANDARD_DEVIATION ) m2 from dba_hist_sysmetric_summary where metric_name='User Calls Per Sec') select avg(a.average - a.STANDARD_DEVIATION) "A - Good", avg(a.average) "Average", avg(a.average + a.STANDARD_DEVIATION) "B - Bad" from dba_hist_sysmetric_summary a, dba_hist_sysmetric_summary b, epsilon e where a.metric_name='SQL Service Response Time' and b.metric_name='User Calls Per Sec' and a.snap_id = b.snap_id and b.average between e.m1 and e.m2 / A - Good Average B - Bad</p> ---------- ---------- ---------- .026797584 .04644541 .066093237
Please note the subquery called epsilon. I have used it here to limit the history from which we are learning to a subset of AWR snapshots where there was more meaningful work done on the database. It does not take into account times of very low activity and times of very high (abnormally) high activity, which don’t necessarily show a representative load from which we can extract our “normal” response time behavior.
So now when we check the current response time:
SQL> select to_char(begin_time,'hh24:mi') time, value "Response Time" from v$sysmetric where metric_name='SQL Service Response Time' / TIME Response Time ---------- ------------- 02:23 .036560192
If it goes above point B (over .066093237), we might have a reason for concern.
But what about determining if we have a normal or bad throughout? For some applications this might be a more useful metric to determine current performance. So we can use the same method above, but just change the metric we are monitoring to Physical Reads Direct Per Sec and Logical Reads Per Sec.
Specific Response Time
When looking into response time and throughput, we see that they are actually dependent on each other. Increased response time will lead to decreased throughput and increased throughput might eventually lead to increased response time due to the system resources (CPUs, I/O subsystems…) becoming saturated and ultimately overloaded.
So I was thinking that we could not just compare response time at one point to another without taking into account both of these metrics at the same time. We could use a new, so called “specific response time” per 1 GB/sec throughput. I calculated it like this:
sRT = Response Time (in ms) / Throughput (in GB/sec)
So we can calculate baseline points A and B (for an 8 KB block database):
SQL> with epsilon as (select avg(average - STANDARD_DEVIATION ) m1, avg(average + STANDARD_DEVIATION ) m2 from dba_hist_sysmetric_summary where metric_name='User Calls Per Sec') select avg( ((a.average-a.standard_deviation)*10) / (((c.average-c.standard_deviation) + (d.average-d.standard_deviation))/131072)) A , avg( (a.average*10) / ((c.average + d.average)/131072)) "Average" , avg( ((a.average+a.standard_deviation)*10) / (((c.average+c.standard_deviation) + (d.average+d.standard_deviation))/131072)) B from dba_hist_sysmetric_summary a, dba_hist_sysmetric_summary b, dba_hist_sysmetric_summary c, dba_hist_sysmetric_summary d, epsilon e where a.metric_name='SQL Service Response Time' and b.metric_name='User Calls Per Sec' and c.metric_name='Logical Reads Per Sec' and d.metric_name='Physical Reads Direct Per Sec' and a.snap_id = b.snap_id and a.snap_id = c.snap_id and a.snap_id = d.snap_id and b.average between e.m1 and e.m2 order by 1 / A Average B ---------- ---------- ---------- .066348184 .095471353 .116012419
Since these are moving window baselines (meaning they will change as time goes by), it is a good idea to compare them to each other periodically. This process will show the trend in the database usage and performance. As I’ve said before, to count for a possible increase in demand put on the database, we can use the specific response time to monitor the trend. From the graph below, we can see the trend line in a spec. response time vs. time graph (I used Excel to draw the graph and draw the trendline):
There is one more thing: Database Efficiency
There is one more thing we need to ask when monitoring performance:
“Can we make the database run faster on the same hardware?” Or it can be translated to: “What percentage of the hardware are we using directly towards executing user calls”? If we say that the database server machine is actually just the CPU(s) and the RAM memory and we want to use these components as much as possible towards end-user calls to minimize time spent on disk, network, SSD, and most importantly wasted end-user time (such as sleeping while waiting for a latch, lock, or a mutex to be free), we can translate it once more to the DBA language like ‘Percentage of DB time spent on CPU”. DB time, as we know, is the sum of all the time end-user sessions (foreground processes) were in active state. If the process is “on CPU”, it should mean that it is actively using the, as I would call it, “primary hardware”, being the CPU and RAM. In other words, it is getting the most out of the hardware on which we are using the database instance. A latch spinning is certainly a CPU operation, but it is reported as wait time in the database and not getting in the CPU time metric. So, we can say that if more of the DB time is spent on CPU, the DB instance is more efficient. Of course, we need to also consider CPU load as well. If it goes too high, it means that we have reached the hardware limits.
There is a metric for this that we can monitor, but as with the (specific) response time, we need to establish what will be called good, normal and bad efficiency. Inspired by the energy efficiency ranking graphic with colors, which we can see on different electric appliances and for cars as well, we can also rank database instance effeciency in a similar way:
Again, we can establish some values (for example, from the standard ranking for efficiency as shown in the image above, and go by these values). Or, we can create moving baselines as previously from the history of the particular DB instance usage by using the query (though with not that much ranks):
with epsilon as (select avg(average - STANDARD_DEVIATION ) m1, avg(average + STANDARD_DEVIATION ) m2 from dba_hist_sysmetric_summary where metric_name='User Calls Per Sec') select avg(round(a.average + a.STANDARD_DEVIATION)) + stddev(round(a.average + a.STANDARD_DEVIATION)) A, avg(round(a.average + (a.STANDARD_DEVIATION/2))) + stddev(round(a.average + (a.STANDARD_DEVIATION/2))) B, avg(round(a.average)) C, avg(round(a.average - (a.STANDARD_DEVIATION/2))) - stddev(round(a.average - (a.STANDARD_DEVIATION/2))) D, avg(round(a.average - a.STANDARD_DEVIATION)) - stddev(round(a.average - a.STANDARD_DEVIATION)) E from dba_hist_sysmetric_summary a, dba_hist_sysmetric_summary b, epsilon e where a.metric_name='Database CPU Time Ratio' and b.metric_name='User Calls Per Sec' and a.snap_id = b.snap_id and b.average between e.m1 and e.m2 / A B C D E ---------- ---------- ---------- ---------- ---------- 73.2758612 68.301602 55.1180124 40.8703584 36.8510341
So the C value is just the average CPU % in the DB time metric we have managed to have until now. We can consider having a “normal” efficiency if the current value is between the points B and D. Here are these points as represented in a distribution graph:
You may notice that I have increased the size of the region with normal DB efficency (from B to D) by taking the outer bounds of the subset. The is the average of the AWR snapshot (30 minutes here) and I add/subtract this standard deviation value, but then I average over all AWR snapshots, and I add/subtract the standard deviation of this range as well.
avg ( avg_per_AWR_snapshot +/- standard_devaition ) +/- stddev( avg_per_AWR_snapshot +/- standard_devaition )
I am looking to get a bigger range of values in which I will put the values that I consider to be OK (normal) and won’t alert so often for short transient efficiency degradation.