Cleaning up PID files when Oracle GRID/RAC upgrades and patches fail

Posted in: Oracle, Technical Track

Unfortunately is it not all that unusual for a patch or upgrade to Oracle Grid or RAC to fail, requiring a rollback.

Frequently when that is necessary, the rollback actions then also fail.

The reason for the rollback failure is often to ‘dirty’ PID files that Oracle does not clean up.

Oracle keeps a number of PID files around to keep track of the Process ID for various processes.

Following is an example from an Oracle 11gR1 RAC node:

[[email protected] ~]# find /u01/ -name \*.pid
/u01/app/11.2.0/grid/crf/admin/run/crfmond/soravm01.pid
/u01/app/11.2.0/grid/crf/admin/run/crflogd/loravm01.pid
/u01/app/11.2.0/grid/gipc/init/oravm01.pid
/u01/app/11.2.0/grid/mdns/init/oravm01.pid
/u01/app/11.2.0/grid/gpnp/init/oravm01.pid
/u01/app/11.2.0/grid/ctss/init/oravm01.pid
/u01/app/11.2.0/grid/ologgerd/init/oravm01.pid
/u01/app/11.2.0/grid/ohasd/init/oravm01.pid
/u01/app/11.2.0/grid/evm/init/oravm01.pid
/u01/app/11.2.0/grid/osysmond/init/oravm01.pid
/u01/app/11.2.0/grid/log/oravm01/agent/crsd/oraagent_oracle/oraagent_oracle.pid
/u01/app/11.2.0/grid/log/oravm01/agent/crsd/orarootagent_root/orarootagent_root.pid
/u01/app/11.2.0/grid/log/oravm01/agent/ohasd/oraagent_oracle/oraagent_oracle.pid
/u01/app/11.2.0/grid/log/oravm01/agent/ohasd/orarootagent_root/orarootagent_root.pid
/u01/app/11.2.0/grid/log/oravm01/gpnpd/oravm01.pid

What is in those files? Just a PID.

Here’s an example:

[[email protected] ~]# cat /u01/app/11.2.0/grid/crs/init/oravm01.pid
4999
[[email protected] ~]#
[[email protected] ~]# ps -p 4999 -o cmd
CMD
/u01/app/11.2.0/grid/bin/crsd.bin reboot

When patching and the patch process fails, a number of these PID files may remain behind, even though the processes they represent may no longer be running.

When an attempt is made to rollback the patch, Oracle will not restart these processes, as it reads the PID file and believes that process is already running.

Why Oracle does not do a cleanup of dead PID files is something of a mystery to me.

In any case, here is a small script to rename all PID files that do not have a corresponding Process.

#!/bin/bash

# chkpid.sh

for pidfile in $(find /u01/ -name \*.pid)
do

   pid=$(cat $pidfile)


   ps -p $pid > /dev/null

   ret=$?


   if [[ $ret -ne 0 ]]; then
      echo "#######################"
      echo " PID: $pid"
      echo " Pid not found for file:"
      ls  -ld $pidfile
      mv $pidfile ${pidfile}.old
   fi
done

This script has been used a number of times now when it became impossible to rollback a failed patch.

This doesn’t always work, but frequently it does.

Why do I not just delete the files? Because I may want to verify some process ID’s with some log and trace files.

How often does this actually happen?

I have personally seen this occur a number of times. The most recent was a 2 node RAC. The script made it possible to restart one of the nodes. The other node however required RAC reconfiguration. Even so it was a sigh of relief to get one node up immediately and ensure all was OK with the database.

email

Author

Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

Oracle experience: started with Oracle 7.0.13 Programming Experience: Perl, PL/SQL, Shell, SQL Also some other odds and ends that are no longer useful Systems: Networking, Storage, OS to varying degrees. Have fond memories of DG/UX

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *