It started out innocently enough: Two node RAC cluster on two Linux RHEL5 with Netapp NFS used as shared filesystem for all shared files. My favorite OS and storage, so I felt confident that clusterware installation will be as smooth as it usually is. I told the customer that this can be done in 3 hours.
What I didn’t take into account is that this was my first 11gR2 installation, and that much have changed since 11gR1. As things turned out, it took over 20 hours of my time and a lot of help from colleagues and even former colleagues before we had a successful installation.
The time it takes you to read this blog post (and any other on this subject) is likely to be time well spent.
Lesson #1:Silent install error messages require secret decoder ring
If the grid silent install says:
[FATAL] [INS-40902] Missing entries in the node Information table.
It means that oracle.install.crs.config.clusterNodes parameter has incorrect value. Note that the correct format is node1:node1-vip,node2:node2-vip.
The correct syntax is documented in the sample response file, but figuring out which parameter is wrong from the error message is not obvious.
Lesson #2: You can install grid infrastructure first, and configure it later.
I used it as an extra troubleshooting attempt (“Maybe this way it will work!).
Whether you install with GUI or do silent installs, you have the option to pick “software only” for your Grid Infrastructure install. Later, when you are ready to configure your clusterware, you go to $CRS_HOME/crs/config and run config.sh – this can also run as GUI or in silent mode from a response file. You can even use the same response file you use for regular installation.
Lesson #3: Grid is very picky and somewhat uninformative about its NFS support
Like the an annoying girlfriend, the installer seems to say “Why should I tell you what’s the problem? If you really loved me, you’d know what you did wrong!”
You need to trace the installer to find out what exactly it doesn’t like about your configuration.
Running the installer normally, the error message is:
[FATAL] [INS-41321] Invalid Oracle Cluster Registry (OCR) location.
CAUSE: The installer detects that the storage type of the location (/cmsstgdb/crs/ocr/ocr1) is not supported for Oracle Cluster Registry.
ACTION: Provide a supported storage location for the Oracle Cluster Registry.
OK, so Oracle says the storage is not supported, but I know that Netapp NFS is support just fine. This means I used the wrong parameters for the NFS mounts. But when I check my fstab and /etc/mount, everything looks A-OK. Can Oracle tell me what exactly bothers it?
It can. If you run the silent install by adding the following flags to the command line:
Then you will see the following lines that explain why Oracle does not like your storage:
[main] [ 2011-01-04 23:43:55.184 GMT+00:00 ] [TaskSharedStorageAccess.reportStorageExceptions:754] Adding exception for node [node01]:
[main] [ 2011-01-04 23:43:55.184 GMT+00:00 ] [TaskSharedStorageAccess.reportStorageExceptions:755] Exception message: Mount options did not meet the requirements [Expected = “rw,hard,rsize>=32768,wsize& gt;=32768,proto=tcp |tcp,vers=3|nfsvers=3|nfsv3|v3,timeo>=600, acregmin=0&acregmax=0&acdirmin=0& amp;acdirmax=0|actimeo=0″ ; Found = “rw,vers=3,rsize=32768,wsize=32768, acregmin=0,acregmax=0, acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=300,retrans=2,sec=sys, addr=fas01b”]
This way it was much easier to see that I had timeo=300 while Oracle wanted timeo>=600.
Lesson #4: Your NFS configuration is not what you think it is.
If /etc/fstab says “timeo=600” and running “mount” shows that the volume is mounted with “timeo=600”, why does Oracle thinks that the volume is mounted with “timeo=300”?
Turns out that the right place to look if you want to know what is your real NFS configuration is in “/proc/mounts”. The man page for “mount” says:
It is possible that files /etc/mtab and /proc/mounts don’t match. The first file is based only on the mount command options, but the content of the second file also depends on the kernel and others settings (e.g. remote NFS server. In particular case the mount command may reports unreliable information about a NFS mount point and the /proc/mounts file usually contains more reliable information.)
Aha! So /proc/mounts shows that timeo=300, which causes installation to fail, and the man page says that this could be caused by remote NFS server settings. Perfect. The problem was packaged and sent to the customer’s sysadmin, and was solved by the next morning.
Lesson #5: Oracle can’t always detect NTP correctly.
The installation failed because:
[performChecks.flowWorker] [ 2011-01-05 19:35:29.019 GMT+00:00 ] [TaskDaemonLiveliness.displayDaemonLivelinessOutput:283] Daemon ‘ntpd’ is not running on node: ‘node01’
This is a legitimate error. But when I check, “ps -ef” shows that ntpd process is running. Grid installer doesn’t use ps to check if ntp is running, instead it looks for /var/run/ntpd.pid.
This is bad because there is no guarantee that your ntpd.pid file will be in that name and location – the name and location of pid file is a configurable ntpd parameter.
In our case, it looked like someone used Windows to edit the ntpd configuration file, and the name of the file was:
/var/run/ntpd.pid? (or /var/run/ntpd.pid\r – depending on the tool you use to check).
Small enough change, but it means that the install was unable to detect ntpd and failed.
I used Perl to fix the issue:
perl -pe ‘s|\r\n|\n|’ /etc/sysconfig/ntpd
but maybe a smarter solution would be:
“touch /var/run/ntpd.pid” – after all, the installation just checks that the file exists, it doesn’t really check if ntp is running!
Lesson #6: Bonus lesson: You need multicast!
You also need a patch if you want Oracle to use a standard multicast address.
I’ll point you to this blog which points out the issues and solutions with 11gR2 multicast requirements and explains it all much better than I can.