Oracle Performance Issue: High Kernel-Mode CPU Usage

Riyaj Shamsudeen

October 30, 2008

Recently, I resolved a performance issue with one of our esteemed clients. I found the problem interesting and worth blogging about.

The problem

An application makes 300 static connections to database DB1 in the database server, say CLNTDB server. The application relies on database link and over a period of time, each session executes SQL through a database link, creating a connection in central database PROD1. So, there are 300 connections in the PROD1 database coming from the DB1 database through database links. Performance is fine during normal operation.

The problem starts when the application is shutdown. Shutting down the application in DB1 creates massive CPU consumption on the PROD1DB server. Unfortunately, this spike in CPU usage lasts for five to ten seconds and causes the ASM heartbeat to fail. Considering that PROD1 is a central database, this has a global effect on applications’ functionality. See the presentation below for graphical representation of this problem.

The symptoms

Looking at the symptoms in detail, we can see the CPU usage in sys mode. It is not uncommon to see high CPU usage during a storm of disconnects. But, this specific CPU usage is much higher and uses all CPUs in kernel-mode. If this is a problem due to process destruction, then this will show up on the CLNTDB server too, as it too faces 300 disconnects. But, this problem manifests only on the PROD1DB server.

mpstat output from the PROD1DB server shows the CPU usage in %sys mode. Notice the numbers below, under sys column—almost all CPUs are used in sys mode. We need to drill down further to understand why this many CPUs are used in sys mode. It can also be seen that no other columns have any abnormally higher values:

CPU ..intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    554  237  651  219   87  491    0  4349    9  91   0   0
  1    911    0 2412  591  353  630    0 15210   30  63   0   7
  2    313    0  613  106  190  105    0  3562    8  90   0   2
  3    255    0  492   92  161  530    0  2914    6  92   0   2
  4     86    1    2    3    1   63    0     8    0 100   0   0
  5    662    0 1269  153  326  211    0  6753   13  77   0  10
  6    349    0  589   54  170 1534    0  3002    7  88   0   5
  7    317   81  411  105   51   61    0  2484    4  93   0   3
  8   6423 6321  697   36  148  546    0  3663    6  86   0   9
  9    363    0  639   41  154 1893    0  3214    9  85   0   6
 10    456    0  964  194  193   77    0  5797   10  81   0   8
 11    104    0   42    3    9  183    0   207    0  95   0   5
 12    195    0  279  110   31   80    0  1590    3  97   0   0
 13    449    0  844  117  158  127    0  4486    7  85   0   8

Test case

Let’s analyze this issue for a single connection to simplify the problem. I created a test user in DB1 database, created a private database link from DB1 to PROD1, and then executed a select statement over that database link.

In DB1:

create user test1 identified by test1;
grant connect, resource to test1;
conn test1/test1
create database link proddb
connect to test1 identified by test1
using 'proddb';

select * from dual@proddb;

At this point, a database link connection has been created on the PROD1 database. Let’s query v$session in PROD1 database to find the session created for database link. SID 4306 is the session created for our test connection from DB1 database.

select sid, serial#, LOGON_TIME,LAST_CALL_ET from v$session where
  logon_time > sysdate-(1/24)*(1/60) and machine='machine_name_here'
/
       SID    SERIAL# LOGON_TIME           LAST_CALL_ET
---------- ---------- -------------------- ------------
      4306      51273 12-SEP-2008 20:47:29            1

Associating this session to v$process, we find the UNIX process ID for this session. Now, we will start truss on this UNIX process( Solaris platform):

truss -p  -d -o /tmp/truss.log

In DB1, we execute exit. This will disconnect the session in the PROD1 database.

Truss output

The output of truss shows activity generated for the disconnect from the PROD1 server. Reading through the truss output, we can see that system call (shmdt) consumed CPU time. Calculating from the output below (18.5053-18.4807=0.0242), we see that each shmdt call consumes approximately 24ms. We are using the -d flag as above to print the time spent in that call. Of course, this is a system call, and so this CPU usage will be in kernel-mode.

18.4630 close(10)                                       = 0
18.4807 shmdt(0x380000000)                              = 0
18.5053 shmdt(0x440000000)                              = 0
18.5295 shmdt(0x640000000)                              = 0
18.5541 shmdt(0x840000000)                              = 0
18.5784 shmdt(0xA40000000)                              = 0
18.6026 shmdt(0xC40000000)                              = 0
18.6273 shmdt(0xE40000000)                              = 0
18.6512 shmdt(0x1040000000)                             = 0
18.6752 shmdt(0x1240000000)                             = 0
18.6753 shmdt(0x1440000000)                             = 0

So, one disconnect executes 10 system calls and consumes approximately 0.24 CPU seconds in kernel-mode. shmdt calls are used to detach from a shared memory segment. Since there are 10 shared memory segments ( as visible in ipcs -ma), 10 shmdt calls are executed per session disconnect.

Projecting this calculation for 300 connections, CPU consumption will last approximately 72 seconds in total. With 12 CPUs, at least 6 seconds will be used in kernel-mode, assuming linear projections. (But in practice, due to mutex calls and such, this may not be linear and will be over 72 seconds.) This matches our observation: 5-10 seconds of CPU consumption in kernel-mode. The first thing we need to do is reduce number of shmdt calls. One way is to reduce the number of shared memory segments.

Shared memory segments

I thought that the SHMMAX kernel parameter (see also Paul Moen’s The mysterious world of shmmax and shmall) was limiting as the SGA size was bigger than SHMMAX size. After changing the SHMMAX parameter, restarting the server and so on, many shared memory segments were still being created.

That’s interesting. I tried truss-ing the instance startup to understand why multiple shared memory segments are created. A few lines from the truss output shows calls to system calls _lgrpsys and pset_bind.

4.5957 munmap(0xFFFFFD7FFDAE0000, 32768)               = 0
4.5958 lgrp_version(1, )                               = 1
4.5958 _lgrpsys(1, 0, )                                = 42
4.5958 _lgrpsys(3, 0x00000000, 0x00000000)             = 19108
4.5959 _lgrpsys(3, 0x00004AA4, 0x06399D60)             = 19108
4.5959 _lgrpsys(1, 0, )                                = 42
4.5960 pset_bind(PS_QUERY, P_LWPID, 4294967295, 0xFFFFFD7FFFDFB11C) = 0
4.5960 pset_info(PS_MYID, 0x00000000, 0xFFFFFD7FFFDFB0D4, 0x00000000) = 0
4.5961 pset_info(PS_MYID, 0x00000000, 0xFFFFFD7FFFDFB0D4, 0x061AA2B0) = 0

The _lgrpsys calls indicate that there is some form of NUMA activity going on here. pset_bind is used to bind a thread or process to a specific processor set.

Enter NUMA.

NUMA or Locality groups

NUMA stands for Non Uniform Memory Access. In high-end SMP systems, the backplane bus can become a bottleneck. In non-NUMA platforms, all memory access must go through the system bus and that can reduce scalability.

NUMA technology (specifically, cache-coherent NUMA or ccNUMA) was introduced to relieve symptoms of this bottleneck. Every processor has its own local memory connected by some form of hardware interconnect. A differentiation is also made between local and remote memory in NUMA-based servers. Compared to access to remote memory from a processor set, access to local memory has less latency. There’s an enormous amount of literature available about NUMA. Refer to the Wikipedia entry for NUMA, and to Kevin Closson’s blog on Oracle on Opteron, K8L, NUMA, etc. And also to my presentation below.

On Solaris, NUMA is implemented as locality groups. To oversimplify, a process running in a locality group will have much less latency accessing memory local to that group. As the “remoteness” of memory increases, latency also increases. In future releases of Solaris, this locality might be applied to other resources such as I/O.

To optimally use NUMA technology, Oracle code spreads SGA in to all locality groups. Then a DBWR is assigned to that locality group. pset_bind calls binds that DBWR to a specific locality group or processor set. (I believe that an LGWR is also assigned and bound to a processor set, but I cannot confirm that.)

Now we understand that why there are 10 shared memory segments for this SGA.

I was curious—with 12 CPUs in the server, do we really have this many locality groups defined? Or are we hitting some kind of bug?

Locality group

lgrpinfo can be used to get locality group details. We found identical machine architecture, and installed observability tools from OpenSolaris.org’s Observability tools for NUMA.

The output of lgrpinfo shows locality groups defined in this server:

#/usr/local/bin/lgrpinfo
lgroup 0 (root):
     Children: 10 12 14 15 17 19 21 23
     CPUs: 0-15
     Memory: installed 65024 Mb, allocated 2548 Mb, free 62476 Mb
     Lgroup resources: 1-8 (CPU); 1-8 (memory)?
     Latency: 146
lgroup 1 (leaf):
     Children: none, Parent: 9 <-- leaf node
     CPUs: 0 1
     Memory: installed 7680 Mb, allocated 1964 Mb, free 5716 Mb
     Lgroup resources: 1 (CPU); 1 (memory)?
     Load: 0.105
     Latency: 51
...
lgroup 9 (intermediate):
     Children: 1, Parent: 10
     CPUs: 0-5
     Memory: installed 24064 Mb, allocated 2187 Mb, free 21877 Mb
     Lgroup resources: 1-3 (CPU); 1-3 (memory)
     Latency: 81
...
lgroup 10 (intermediate):
     Children: 9, Parent: 0  <-- intermediate group
     CPUs: 0-13
     Memory: installed 56832 Mb, allocated 2491 Mb, free 54341 Mb
     Lgroup resources: 1-7 (CPU); 1-7 (memory)
     Latency: 113
...

There are many locality groups defined in Solaris. Local groups are defined as a hierarchy, and seven of them are leaf-node local groups. In the output above, lgroup 0 is the root group. Intermediate groups are 10, 12, 14, 15, 17, 19, and 23.

For example, for lgroup 1, the hierarchy is: 1 -> 9 -> 10 -> 0.

Locality group latencies can be easily seen with lgrpinfo -l. From the output below, we can see that access to local memory has lower latency. For example, accessing memory area 1 from locality group 1 is 51), whereas access to remote memory is 146 for accessing locality group 24 (these numbers are not actual latency, merely a relative representation of latency):

Recap

The high CPU usage in kernel-mode was caused by an excessive number of calls to shmdt. Since there are 10 shared memory segments, there are 10 shmdt calls per disconnect. SGA is spread across NUMA nodes creating these many segments.

The solution

Ten shared memory segments were created in order to exploit NUMA technology. NUMA is an excellent technology and it is a pity that we are suffering a side-effect of NUMA. We need to resolve this issue and we have a handful of options at this point. We can:

disable NUMA completely, by setting the _enable_numa_optimization to false.
reduce NUMA nodes using _db_block_numa. (Interestingly, this throws an ORA-600 error during startup.)
completely eliminate db link-based architecture with streams or materialized views.

In case you were wondering, I presented part of this to the Slovenian Oracle User Group (SIOUG), and it can be found in presentation format here: High CPU usage – SYS mode (PDF).

Insight and analysis of technology and business strategy