Recreating the Voting disk may not be as easy as written in Metalink. If you work with RAC, you know about Metalink Note 399482.1 : “How to recreate OCR/Voting disk accidentally deleted”. Of course, you back up the voting disk every time you change your RAC configuration, or on a regular basis. You probably played with the procedure and it worked just fine. Like you, I did all of that.
Yesterday, I had to recreate this precious file when it was lost a couple of hours after the whole software stack had been installed. It was, I guess, just before we would have setup our monitoring on the server that would have backed up the voting disk. When recreating the voting disk, I was amazed that the
root.sh on the second node failed with the message below:
Failed to upgrade Oracle Cluster Registry configuration
But, what amazed me even more is that, despite several issues showing up on my Metalink Search with the same error, I wasn’t able to find the fix. At least nobody wanted to share one with me. However, it might frequently occur that you haven’t all nodes installed together, and it’s very likely that some of you will run into this if you are asked to recreate the Voting disk file. It happens when you haven’t installed all the nodes with the first Clusterware
runInstaller command, but instead installed only some of them and then added some nodes with the
How You Can Tell This is Your Problem
If follow the Metalink note, you’ll get an error message as soon as you run the
root.sh script on the second node of your cluster. If you investigate, you’ll find that the error happens during the
ocrconfig -upgrade call and the associated log
ocrconfig_$$.log file located in
Oracle Database 10g CRS Release 10.2.0.1.0 Production Copyright 1996, 2005 Oracle. All rights reserved.2008-02-17 00:16:06.373: [ OCRCONF]ocrconfig starts... 2008-02-17 00:16:06.374: [ OCRCONF]Upgrading OCR data 2008-02-17 00:16:06.382: [ OCRCONF]OCR already in current version. 2008-02-17 00:16:08.073: [ OCRCONF]Failed to call clsssinit (21) 2008-02-17 00:16:08.073: [ OCRCONF]Failed to make a backup copy of OCR 2008-02-17 00:16:08.073: [ OCRCONF]Exiting [status=failed]...
I encountered that with a 10.2.0.3 RAC on Linux, but it should happen on any operating system and, at least, with any version of 10g RAC. The reason is, the node you are running the
root.sh script on doesn’t contain information about itself. You can double-check the cause by looking at the
$ORA_CRS_HOME/install/rootconfig script on the first node. It should have the information about all the nodes in the its variable settings parts:
CRS_HOST_NAME_LIST=node1,1,node2,2,node3,3 CRS_NODE_NAME_LIST=node1,1,node2,2,node3,3 CRS_PRIVATE_NAME_LIST=bode1-priv,1,node2-priv,2,node3-priv,3 CRS_NODELIST=node1,node2,node3 CRS_NODEVIPS='node1/node1-vip/255.255.255.0/eth0,node2/node2-vip/255.255.255.0/eth0,node3/node3-vip/255.255.255.0/eth0'
If you have run the
addNode.sh script, it contains only the informations about the first node in it as below:
CRS_HOST_NAME_LIST=node1,1 CRS_NODE_NAME_LIST=node1,1 CRS_PRIVATE_NAME_LIST=bode1-priv CRS_NODELIST=node1 CRS_NODEVIPS='node1/node1-vip/255.255.255.0/eth0'
How to Fix It
If you are extremely lucky (as I was), your cluster is a two-node cluster with one node installed/configured and the other node added. In that case, the information you need is in the
$ORA_CRS_HOME/install/rootaddnode.sh script, and can be run on the first node just after the
root.sh script has failed on the second node.
If you are just lucky, and have time, you haven’t deleted any node from the cluster and you have always used the same node to add the other nodes. In that case, you’ll have to rebuild the history of the node addition. Unfortunately, Oracle keeps updates of the
root.sh scripts but not the ones from
rootaddnode.sh. Anyway, because Oracle adds the software by copying the whole directory on the new nodes, the different versions of those scripts actually stay on the node that you’ve configured and added. You should be able to find the first
rootconfig and the different
rootaddnode.sh on the other nodes. Once you’ve rebuild the history, run the scripts in the correct order from the first node to rebuild the Voting Disk.
If you’re not lucky, you will have lost some files.
It could also be that you don’t have time for a complete investigation. You still have several options :
- Manually rebuild a
rootconfigscript that fits your needs on the first node. I didn’t actually test this, but from what I’ve seen, updating the following variables with the correct values in this script should work:
CRS_NODEVIPS. If you plan to restore the OCR from the automatic backup once the voting has been recreated, you’ll have to make sure you have reset the node numbers as they were. You can dump the backup of the OCR to find this information in it with the following command:
ocrdump -backuploc $ORA_CRS_HOME/cdata/cluster-name/backup00.ocr
- If you are reading this article in the middle of an emergency, it may be that ending with the reconfiguration of the first node and then re-adding the other nodes is the best option to end up with downtime. In that case, dump the backup of the OCR with
ocrdump -backuploc $ORA_CRS_HOME/cdata/cluster-name/backup00.ocrto help you find the various resource names and configurations; or just rebuild the listener and database resources in the with the
srvctlcommands. The first node should be up in a couple minutes.
You probably don’t care about this post at all if you’ve backed up the Voting Disk. What I tend to do to prevent having to recover the database, and to provide a “minimum” of service, is to keep a non-RAC database software with the same patch level on one or two nodes of the cluster configuration. After all, RAC is just one component of your highly-available environment, and it’s nothing without skilled people around.