Alter session kill on steroids

Posted in: Technical Track

Perhaps you have encountered something like this: A session that is consuming too many resources needs to be killed. You locate the session and use ALTER SYSTEM KILL SESSION ‘SID,SERIAL#’ to kill the session. As you continue to monitor the database you find that the status of the session in v$session is ‘KILLED’, but the session does not go away. You also notice that the SERIAL# is continually changing.

Now you find there is no OS process associated with the session, but the session continues as PMON is unable to finish cleanup for the session. Usually when this happens, the session will be holding a lock. When that happens, the only method to release the lock is to bounce the database. There are some bugs that may be responsible for this problem, such as this one described by Oracle Support:

Pmon Spins While Cleaning Dead Process (Doc ID 1130713.1)

This particular bug affects Oracle 10.2.0.1 – 11.1.0.7. I have personally seen this same behavior happen on many versions of the database from 7.0 on. To avoid these hanging sessions many DBA’s have adopted the habit of first killing the OS process with an OS utility, and if the session is still visible in v$session, issue the ALTER SYSTEM KILL command.

The OS command used on linux/unix is usually ‘kill -9’. On windows it is OraKill. This method usually avoids the problems encountered when killing a session that is holding a lock and processing DML.

I don’t know just what circumstances trigger this behavior, as I have never been able to reproduce it at will. When it does happen though, it is more than annoying as the only way to clear locks held by the recalcitrant session is to bounce the database.

Quite some time ago (at least as far back as Oracle 8i) Oracle introduced the new IMMEDIATE keyword to use with ALTER SYSTEM KILL SESSION. Using this keyword removes the need to use an OS command to kill a session – Oracle will do it for you! To test this I am using Oracle 10.2.0.4 on Oracle Linux 5.5. I have previously run these same tests in 11.2.0.3 with the same results. Had I access to an 8i or 9i database I would have run the tests there. To start with let’s see what happens when a session is killed without the immediate keyword.

Login to the session to be killed:

$ sqlplus scott/[email protected]

Login as SYSDBA from another terminal and check for scott’s session:

[sql]SQL> l
1 select
2 s.username,
3 s.sid,
4 s.serial#,
5 p.spid spid
6 from v$session s, v$process p
7 where s.username = ‘SCOTT’
8* and p.addr = s.paddr
SQL> /

USERNAME SID SERIAL# SPID
—————————— ———- ———- ————
SCOTT 133 35 22870

1 row selected.
[/sql]

All that has happened at this point is that Oracle has made an internal call that has disconnected Scott’s session. (tracing that operation is a different topic.) The process on the server has not been terminated. This can be seen by the following experiment:

Logon again as Scott.

In a SYSDBA session check for Scott’s:

[sql] SQL> @scott

USERNAME SID SERIAL# SPID
—————————— ———- ———- ————
SCOTT 146 81 23678
[/sql]

Now check for the shadow process associated with scott’s session on the server:


[[email protected] tmp]# ps -fp 23678
UID PID PPID C STIME TTY TIME CMD
oracle 23678 1 0 16:56 ? 00:00:00 oraclejs01 (LOCAL=NO)

Kill the session and check the status:

[sql] SQL> alter system kill session ‘146,81’;

SQL> l
1 select
2 s.username,
3 s.sid,
4 s.serial#,
5 p.spid spid
6 from v$session s, v$process p
7 where s.username = ‘SCOTT’
8* and p.addr = s.paddr
SQL>/

no rows selected
[/sql]

Check again on the server for the process:

[[email protected] tmp]# ps -fp 23678
UID PID PPID C STIME TTY TIME CMD
oracle 23678 1 0 16:56 ? 00:00:00 oraclejs01 (LOCAL=NO)

Interesting, isn’t it? We know the process is still alive on the server, but the session information is no longer associated with the process. This happens because Oracle has disconnected the session, which allows the process to continue until the sqlplus session is terminated. The session information is still available in v$session, but is no longer associated with a server process:

[sql] select
2 s.username,
3 s.status,
4 s.sid,
5 s.serial#
6 from v$session s
7* where s.username = ‘SCOTT’
SQL>/

USERNAME STATUS SID SERIAL#
—————————— ——– ———- ———-
SCOTT KILLED 146 81

1 row selected.

1* select pid,spid from v$process where pid = 146
SQL>/

no rows selected
[/sql]

When exiting the Scott session, I can see that the session was killed:

[sql] SQL> exit
ERROR:
ORA-00028: your session has been killed
[/sql]

Let’s perform the experiment again, but this time use the IMMEDIATE keyword.

Logon as scott:

> sqlplus scott/[email protected]

SQL*Plus: Release 11.2.0.3.0 Production on Tue Aug 5 17:18:53 2014

Logon as SYSDBA and check for the scott session;

[sql] SQL> @scott

USERNAME SID SERIAL# SPID
—————————— ———- ———- ————
SCOTT 146 83 23939

1 row selected.
[/sql]

Before killing scott’s session:

  • get my OS PID
  • enable 10046 trace

The OS PID will be used for strace on the SYSDBA session shadow process on the server.
The 10046 trace is so we can see what is happening in the strace output.

[sql] SQL> l
1 select
2 s.username,
3 s.sid,
4 s.serial#,
5 p.spid spid
6 from v$session s, v$process p
7 where s.username is not null
8 and p.addr = s.paddr
9 and userenv(‘SESSIONID’) = s.audsid
10* order by username, sid
SQL>/

USERNAME SID SERIAL# SPID
—————————— ———- ———- ————
SYS 145 65 23947

1 row selected.

SQL> alter session set events ‘10046 trace name context forever, level 12’;

Session altered.
[/sql]

Now ssh to the db server , check for Scott session shadow process and start strace:

[[email protected] tmp]# strace -o 23947.strace -p 23947
^Z
[1]+ Stopped strace -o 23947.strace -p 23947
[[email protected] tmp]# bg
[1]+ strace -o 23947.strace -p 23947 &

[[email protected] tmp]# ps -p 23939
PID TTY TIME CMD
23939 ? 00:00:00 oracle

Now kill Scott’s session and exit the SYSDBA session:

[sql] SQL> alter system kill session ‘146,83’ immediate;

System altered.
[/sql]

The strace command will now have exited on the server.

First check again for Scott’s session:

[[email protected] tmp]# ps -p 23939
PID TTY TIME CMD
[[email protected] tmp]#

So the Scott shadow process has terminated.

As the 10046 trace was enabled, the output to the oracle trace file will appear in the strace file, which allows searching for ‘alter system kill’ in the strace file.

From the strace file:

write(5, "alter system kill session '146,8"..., 44) = 44

Now searching for the PID of scott’s session 23939:

[sql] read(10, "23939 (oracle) S 1 23939 23939 0"…, 999) = 228
close(10) = 0
open("/proc/23939/stat", O_RDONLY) = 10
read(10, "23939 (oracle) S 1 23939 23939 0"…, 999) = 228
close(10) = 0
kill(23939, SIGKILL) = 0
kill(23939, SIGCONT) = 0
open("/proc/23939/stat", O_RDONLY) = 10
read(10, "23939 (oracle) Z 1 23939 23939 0"…, 999) = 178
close(10) = 0
[/sql]

From the previous text I can see that Oracle opened the status file for PID 23939.
Why it did so twice I am not sure.

What happens after that is the interesting part.

kill(23939, SIGKILL) = 0

That line means that the SIGKILL signal was successfully sent to Scott’s shadow process.

What does that mean? Run kill -l to get a list of signals:

kill -l
 1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL       5) SIGTRAP
 6) SIGABRT      7) SIGBUS       8) SIGFPE       9) SIGKILL     10) SIGUSR1
11) SIGSEGV     12) SIGUSR2     13) SIGPIPE     14) SIGALRM     15) SIGTERM
16) SIGSTKFLT   17) SIGCHLD     18) SIGCONT     19) SIGSTOP     20) SIGTSTP
21) SIGTTIN     22) SIGTTOU     23) SIGURG      24) SIGXCPU     25) SIGXFSZ
26) SIGVTALRM   27) SIGPROF     28) SIGWINCH    29) SIGIO       30) SIGPWR
31) SIGSYS      34) SIGRTMIN    35) SIGRTMIN+1  36) SIGRTMIN+2  37) SIGRTMIN+3
38) SIGRTMIN+4  39) SIGRTMIN+5  40) SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8
43) SIGRTMIN+9  44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9  56) SIGRTMAX-8  57) SIGRTMAX-7
58) SIGRTMAX-6  59) SIGRTMAX-5  60) SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2
63) SIGRTMAX-1  64) SIGRTMAX

Notice that signal 9 (kill -9) is SIGKILL.

So when killing a session with ALTER SYSTEM KILL SESSION ‘PID,SERIAL#’ IMMEDIATE Oracle is actually doing the kill -9 for you, and has been for many years now.

Though not shown here, this same test was run when the session was killed without using the IMMEDIATE keyword, and there were no attempts to kill the session. This was inferred as well by the fact the the process was still running on the server up until the time the Scott sqlplus session was exited.

email

Author

Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

Oracle experience: started with Oracle 7.0.13 Programming Experience: Perl, PL/SQL, Shell, SQL Also some other odds and ends that are no longer useful Systems: Networking, Storage, OS to varying degrees. Have fond memories of DG/UX

13 Comments. Leave new

Maris Elsins
August 8, 2014 2:33 am

I had always used “alter system disconnect session …” when I needed to terminate it immediately without letting it wait for the ongoing call to complete.
Now when you showed me the “alter system kill session .. immediate”, it looked exactly the same as disconnect session, so I checked the latest documentation here https://docs.oracle.com/database/121/SQLRF/statements_2017.htm#SQLRF53113 and confirmed:
A) DISCONNECT SESSION = KILL SESSION IMMEDIATE
B) KILL SESSION = DISCONNECT SESSION POST_TRANSACTION
basically “KILL SESSION”, and “DISCONNECT SESSION” does the same thing, but the default action is different.

Reply

Thanks for the comments Maris. I’ve never really made use of DISCONNECT SESSION, so I tried it.
Unlike KILL, DISCONNECT SESSION requires either the POST_TRANSACTION or IMMEDIATE keyword:


SQL> alter system disconnect session '142,1628';
alter system disconnect session '142,1628'
*
ERROR at line 1:
ORA-02000: missing POST_TRANSACTION or IMMEDIATE keyword

Here’s the session info:

CLIENT SRVR
USERNAME SID SERIAL# SQL ID PID STATUS MACHINE OSUSER CLIENT PROGRAM PID SERVER PROGRAM PID LOGON TIME IDLE TIME
---------- ------ ------- -------------- ------ ---------- ---------- ------- -------------------- ------------ -------------------- ----- ----------------- -----------
DBSNMP 128 783 20 INACTIVE ora10gR2.j oracle JDBC Thin Client 1234 [email protected] 11610 08/08/14 08:57:19 00:00:03:24
ks.com
132 1103 22 INACTIVE ora10gR2.j oracle JDBC Thin Client 1234 [email protected] 11616 08/08/14 08:57:19 00:00:03:24
ks.com
140 1314 15 INACTIVE ora10gR2.j oracle JDBC Thin Client 1234 [email protected] 11164 08/08/14 08:42:19 00:00:03:21
ks.com
146 1843 23 INACTIVE ora10gR2.j oracle JDBC Thin Client 1234 [email protected] 11622 08/08/14 08:57:19 00:00:03:24
ks.com
149 24 16 INACTIVE ora10gR2.j oracle JDBC Thin Client 1234 [email protected] 21564 08/05/14 15:38:43 00:00:00:06
ks.com

SCOTT 142 1628 7h35uxf5uhmm1 24 INACTIVE poirot jkstill [email protected] (TNS 4325 [email protected] 11844 08/08/14 09:00:26 00:00:00:15

SYS 138 1608 chmy5m24rfhvp 26 ACTIVE poirot jkstill [email protected] (TNS 4327 [email protected] 11852 08/08/14 09:00:41 00:00:00:00

Now to disconnect the session:


SQL> alter system disconnect session '142,1628' IMMEDIATE;

System altered.

Checking the strace file for the SYSDBA session I find that indeed SIGKILL is used

open("/proc/11844/stat", O_RDONLY) = 10
read(10, "11844 (oracle) S 1 11844 11844 0"..., 999) = 228
close(10) = 0
kill(11844, SIGKILL) = 0
kill(11844, SIGCONT) = 0
open("/proc/11844/stat", O_RDONLY) = 10
read(10, "11844 (oracle) Z 1 11844 11844 0"..., 999) = 178
close(10) = 0
semtimedop(524288, 0x7fff711b2b70, 1, {0, 50000000}) = -1 EAGAIN (Resource temporarily unavailable)
open("/proc/11844/stat", O_RDONLY) = -1 ENOENT (No such file or directory)

What cannot be seen from strace though is what Oracle library calls are being made.
That could be traced with gdb, maybe something to check out another time.

The reason I mention this is due to the differences in the documentation for KILL and DISCONNECT SESSION.

KILL

IMMEDIATE Specify IMMEDIATE to instruct Oracle Database to roll back ongoing transactions, release all session locks, recover the entire session state, and return control to you immediately.

DISCONNECT SESSION

The IMMEDIATE setting disconnects the session and recovers the entire session state immediately, without waiting for ongoing transactions to complete.

Whether or not there are actual differences or this is a documentation bug, I cannot say at this time.
The documenation however is not consistent – further testing would be required to determine if these are really equivilent.

The documentation also implies that “DISCONNECT SESSION ‘sid,serial#’ IMMEDIATE POST_TRANSACTION” can be used:


"The IMMEDIATE setting disconnects the session and recovers the entire session state immediately, without waiting for ongoing transactions to complete.

If you also specify POST_TRANSACTION and the session has ongoing transactions, then the IMMEDIATE keyword is ignored."

That however does not work (tested on 11.2.0.4):


SQL> alter system disconnect session '60,14681' immediate post_transaction;
alter system disconnect session '60,14681' immediate post_transaction

Reply

Using Frits Hoogland’s macros I ran gdb against two SYSDBA sessions, one of which I used to terminate a session with KILL IMMEDIATE and the other to terminate a session with DISCONNECT SESSION IMMEDIATE.

Both appear to do the same thing, at least as far as I can tell at this time.

So Maris, your assertions that these are identical seem to correct.

The documentation is not quite so clear on this.

KILL IMMEDIATE

Continuing.
opiclo — close cursor
opiosq0 — parse
kkshini — hard parse
opiexe — execute

opiclo — close cursor
qerxtIsActive — external table fetch

Program exited normally.

DISCONNECT SESSION IMMEDIATE

Continuing.
opiclo — close cursor
opiosq0 — parse
kkshini — hard parse
opiexe — execute

opiclo — close cursor
qerxtIsActive — external table fetch

Program exited normally.

Reply

Jared, thanks for the research!

Reply
Vyacheslav Rasskazov
August 12, 2014 3:44 pm

kill/disconnect immediate does not send SIGKILL at all cases. For example, SIGKILL will not be sent to the process that executes the next query:
with D as
(
select 5 a, 7 b from dual
union all select 7 a, 5 b from dual
union all select 6 a, 6 b from dual
union all select -5 a, -7 b from dual
union all select -7 a, -5 b from dual
union all select 7 a, -5 b from dual
union all select -5 a, 7 b from dual
union all select -7 a, 5 b from dual
union all select 5 a, -7 b from dual
union all select 1 a, 0 b from dual
union all select 0 a, 1 b from dual
union all select 0 a, -1 b from dual
union all select -1 a, 0 b from dual
union all select 0 a, 0 b from dual
union all select 0 a, 1/3 b from dual
union all select 1/3 a, 0 b from dual
),
I0 as (select a, b, ((a – b) / 40) + 0.61 x from D),
I1 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I0),
I2 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I1),
I3 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I2),
I4 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I3),
I5 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I4),
I6 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I5),
I7 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I6),
I8 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I7),
I9 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I8),
I10 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I9),
I11 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I10),
I12 as (select /*+ materialize */ a, b, 1-(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x)))*(1-(1-(1-x*x)*(1-x*x))*(1-(1-x*x)*(1-x*x))) x from I11)
select a, b, a*x + b*(1-x) from I12

Reply

Interesting. Explanation please?

Reply
Vyacheslav Rasskazov
August 12, 2014 4:34 pm

I have not explanation at this time. Maybe gdb would help. I just remembered that worked on this topic some years ago and kept this query as example where explicit kill -9 needed.

Reply

Thanks! It would definitely be interesting.
Is this the simplest form of the query this applies to?
Just asking as this will create quite a bit of profiling output.
I guess I could just try it and see :)

Reply
Vyacheslav Rasskazov
August 12, 2014 4:51 pm

No, it’s definitely not a simplest form :) Just an example that works at least in 11.2.0.4.

I ran this query, and as you said kill() is not used on the session running this query.

gdb shows the same output for sessions that are killed with kill() as well as those are disconnected internally:

opiclo — close cursor
opiosq0 — parse
kkshini — hard parse
opiexe — execute

strace on the SYSDBA session confirms that kill() is used when the session is static, and not used when the query is running.

Next I tried running this query in the scott session:
select * from all_tab_columns

When killing the session the results were the same – kill() was not used.

Before deciding what to do in each case oracle is reading the /proc/PID/stat file.
Comparing the values for running and not running a query didn’t show any obvious differences that might trigger this behavior.

It seems that if the session is doing any work then kill() is not used. The results were the same for trying to kill while disk was being read or if data was being fetched and displayed.

So using kill -9 or OraKill may still be appropriate.

Looks like this article will require some editing.

Reply
Patrick Watza
January 16, 2015 4:43 pm

Glad to see this last post – I remember (un)fondly in my last job, a sonic message queue would die, which would leave a transaction (a “select blah_pk_id for update…” command) hanging in the DB.

We ALWAYS had to kill -9 at the OS level, as even with the IMMEDIATE clause the kill session command wouldn’t get the job done. Became very tedious to find and kill these at 2am when there were hundreds of sessions stacked up waiting for commit/rollback.

Thanks for exposing the mechanism behind the IMMEDIATE, Jared!

Reply
naresh bhandare
August 30, 2015 5:33 pm

I am not sure below is a valid SQL to get the process for v$session.sid=146.

1* select pid,spid from v$process where pid = 146

I don’t see v$process.pid to be the same as v$session.sid.

Reply

Hi Naresh,

Probably that is just a typo in the article. Sometimes is it necessary to copy and paste output and edit and format to better fit in an article.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *