Issues and workarounds for Exadata Patching (JAN-2019)

Posted in: Technical Track

Patching an Exadata has moved from a painful task to a near-regular activity with tools such as “patchmgr”, but that doesn’t mean it can’t go sideways. This post will describe some of the most recent issues I have faced while patching a database machine.

Scenario: Applying PSU 01/04/2019 (version 18.1.12.0.0.190111) in both X5-2 / X6-2 stacks, which were running 12.1.2.3.5.170418 before the patching, and upgrading Grid Infrastructure (GI) from current 12.1 to 18c (18.5).

1) Dependency issues due to custom RPM packages

When OS patches are installed, “dbnodeupdate.sh” may prevent an upgrade due to custom RPM packages dependencies in the pre-requisites phase. As Fred describes in this post (https://unknowndba.blogspot.com/2018/12/exadata-patching-patchmgr-modifyatprereq.html), using the ‘-modify_at_prereq’ flag is very dangerous as it will change your system during the pre-requisites step, which is usually executed days before the actual maintenance window. So be aware of this, and avoid this flag at all costs.

So, how should we deal with such packages?

The same post mentioned above also recommends letting “patchmgr” deal with it automatically, but that doesn’t always work, as you can see below, when “patchmgr” failed due to custom RPM packages installed in the system:

  • Pre-requisites were failing due to some dependency issues with custom RPM packages:

 

[root@exa1cel01 dbserver_patch_19.190104]# ./patchmgr -dbnodes ~/dbs_group -precheck -iso_repo /tmp/SAVE/p29181093_*_Linux-x86-64.zip -target_version 18.1.12.0.0.190111 -allow_active_network_mounts
************************************************************************************************************
NOTE patchmgr release: 19.190104 (always check MOS 1553103.1 for the latest release of dbserver.patch.zip)
NOTE
WARNING Do not interrupt the patchmgr session.
WARNING Do not resize the screen. It may disturb the screen layout.
WARNING Do not reboot database nodes during update or rollback.
WARNING Do not open logfiles in write mode and do not try to alter them.
************************************************************************************************************
2019-08-24 00:47:06 -0400 :Working: DO: Initiate precheck on 1 node(s)
2019-08-24 00:54:04 -0400 :Working: DO: Check free space and verify SSH equivalence for the root user to exa1db01
2019-08-24 00:56:31 -0400 :SUCCESS: DONE: Check free space and verify SSH equivalence for the root user to exa1db01
2019-08-24 00:57:32 -0400 :Working: DO: dbnodeupdate.sh running a precheck on node(s).
2019-08-24 01:00:31 -0400 :ERROR : dbnodeupdate.sh precheck failed on one or more nodes
SUMMARY OF WARNINGS AND ERRORS FOR exa1db01:
exa1db01: # The following file lists the commands that would have been executed for removing rpms when specifying -M flag. #
exa1db01: # File: /var/log/cellos/nomodify_results.240819004658.sh. #
exa1db01: ERROR: Found dependency issues during pre-check. Packages failing:
exa1db01: ERROR: Package: 1:dbus-1.2.24-8.0.1.el6_6.x86_64
exa1db01: ERROR: Package: exadata-sun-computenode-exact-18.1.12.0.0.190111-1.noarch (Fails because of required removal of Exadata rpms)
exa1db01: ERROR: Package: glib2-devel-2.28.8-9.el6.x86_64 (Custom rpm fails)
exa1db01: ERROR: Package: gnutls-devel-2.12.23-21.el6.x86_64 (Custom rpm fails)
exa1db01: ERROR: Package: oracle-ofed-release-1.0.0-31.el6.x86_64 (Fails because of required removal of Exadata rpms)
exa1db01: ERROR: Consult file exa1db01:/var/log/cellos/minimum_conflict_report.240819004658.txt for more information on the dependencies failing and for next steps.
exa1db01: The following known issues will be checked for but require manual follow-up:
exa1db01: (*) - Yum rolling update requires fix for 11768055 when Grid Infrastructure is below 11.2.0.2 BP12

2019-08-24 01:00:40 -0400 :ERROR : DONE: dbnodeupdate.sh precheck on exa1db01
2019-08-24 01:00:48 -0400 :INFO : SUMMARY FOR ALL NODES:
2019-08-24 01:00:41 -0400 :ERROR : exa1db01 has state: FAILED
2019-08-24 01:00:48 -0400 :FAILED : For details, check the following files in the /tmp/SAVE/dbserver_patch_19.190104:
2019-08-24 01:00:48 -0400 :FAILED : - <dbnode_name>_dbnodeupdate.log
2019-08-24 01:00:49 -0400 :FAILED : - patchmgr.log
2019-08-24 01:00:49 -0400 :FAILED : - patchmgr.trc
2019-08-24 01:00:49 -0400 :FAILED : DONE: Initiate precheck on node(s).
[INFO ] Collected dbnodeupdate diag in file: Diag_patchmgr_dbnode_precheck_240819004658.tbz
-rw-r--r-- 1 root root 973378 Aug 24 01:00 Diag_patchmgr_dbnode_precheck_240819004658.tbz[root@exa1db01 ~]# rpm -e dbus-1.2.24-8.0.1.el6_6.x86_64 pm-utils-1.2.5-11.el6.x86_64 hal-0.5.14-14.el6.x86_64
Stopping system message bus: [ OK ]

 

 

  • It’s common for RPM packages to have intricate dependencies amongst one another, so trying to remove the packages manually may get you something like this:
[root@exa1db01 ~]# rpm -e dbus-1.2.24-8.0.1.el6_6.x86_64
error: Failed dependencies:
	dbus = 1:1.2.24-8.0.1.el6_6 is needed by (installed) dbus-devel-1:1.2.24-8.0.1.el6_6.x86_64
	dbus >= 0.90 is needed by (installed) hal-libs-0.5.14-14.el6.x86_64
	dbus >= 0.90 is needed by (installed) ConsoleKit-libs-0.4.1-6.el6.x86_64
	dbus >= 0.90 is needed by (installed) ConsoleKit-0.4.1-6.el6.x86_64
	dbus is needed by (installed) polkit-0.96-11.el6.x86_64
	dbus is needed by (installed) GConf2-2.28.0-7.el6.x86_64

 

  • Executing “patchmgr” without “-force_remove_custom_rpms” “patchmgr” always threw the same list of conflicting packages as above, so I tried using “-force_remove_custom_rpms” and it forced “dbnodeupdate.sh” to remove some of the conflicting packages:
[root@exa1cel01 dbserver_patch_19.190104]# nohup ./patchmgr -dbnodes ~/dbs_group -upgrade -iso_repo /tmp/SAVE/p29181093_*_Linux-x86-64.zip -target_version 18.1.12.0.0.190111 -allow_active_network_mounts -force_remove_custom_rpms -rolling &
[1] 28825
[root@exa1cel01 dbserver_patch_19.190104]# nohup: ignoring input and appending output to `nohup.out'
[root@exa1cel01 dbserver_patch_19.190104]# tail -f nohup.out
NOTE patchmgr release: 19.190104 (always check MOS 1553103.1 for the latest release of dbserver.patch.zip)
NOTE
NOTE Database nodes will reboot during the update process.
NOTE
WARNING Do not interrupt the patchmgr session.
WARNING Do not resize the screen. It may disturb the screen layout.
WARNING Do not reboot database nodes during update or rollback.
WARNING Do not open logfiles in write mode and do not try to alter them.
************************************************************************************************************
2019-08-24 01:01:26 -0400 :Working: DO: Initiate prepare steps on node(s).
2019-08-24 01:01:39 -0400 :Working: DO: Check free space and verify SSH equivalence for the root user to exa1db01
2019-08-24 01:02:26 -0400 :SUCCESS: DONE: Check free space and verify SSH equivalence for the root user to exa1db01
2019-08-24 01:03:55 -0400 :SUCCESS: DONE: Initiate prepare steps on node(s).
2019-08-24 01:03:55 -0400 :Working: DO: Initiate update on 1 node(s).
2019-08-24 01:03:55 -0400 :Working: DO: dbnodeupdate.sh running a backup on 1 node(s).
2019-08-24 01:10:00 -0400 :SUCCESS: DONE: dbnodeupdate.sh running a backup on 1 node(s).
2019-08-24 01:10:00 -0400 :Working: DO: Initiate update on exa1db01
2019-08-24 01:10:00 -0400 :Working: DO: Get information about any required OS upgrades from exa1db01.
2019-08-24 01:10:11 -0400 :SUCCESS: DONE: Get information about any required OS upgrades from exa1db01.
2019-08-24 01:10:16 -0400 :Working: DO: dbnodeupdate.sh running an update step on exa1db01.
SUMMARY OF ERRORS FOR exa1db01:
exa1db01: ERROR: Preventive abort of update due to dependency issues. Packages failing:
2019-08-24 01:13:16 -0400 :ERROR : DONE: dbnodeupdate.sh running an update step on exa1db01
2019-08-24 01:13:28 -0400 :FAILED : For details, check the following files in the /tmp/SAVE/dbserver_patch_19.190104:
2019-08-24 01:13:28 -0400 :FAILED : - <dbnode_name>_dbnodeupdate.log
2019-08-24 01:13:28 -0400 :FAILED : - patchmgr.log
2019-08-24 01:13:28 -0400 :FAILED : - patchmgr.trc
2019-08-24 01:13:28 -0400 :FAILED : DONE: Initiate update on exa1db01.
[INFO ] Collected dbnodeupdate diag in file: Diag_patchmgr_dbnode_upgrade_240819010124.tbz
-rw-r--r-- 1 root root 2595992 Aug 24 01:13 Diag_patchmgr_dbnode_upgrade_240819010124.tbz


[root@exa1cel01 dbserver_patch_19.190104]# ./patchmgr -dbnodes ~/dbs_group -precheck -iso_repo /tmp/SAVE/p29181093_*_Linux-x86-64.zip -target_version 18.1.12.0.0.190111 -allow_active_network_mounts
************************************************************************************************************
NOTE patchmgr release: 19.190104 (always check MOS 1553103.1 for the latest release of dbserver.patch.zip)
NOTE
WARNING Do not interrupt the patchmgr session.
WARNING Do not resize the screen. It may disturb the screen layout.
WARNING Do not reboot database nodes during update or rollback.
WARNING Do not open logfiles in write mode and do not try to alter them.
************************************************************************************************************
2019-08-24 01:15:54 -0400 :Working: DO: Initiate precheck on 1 node(s)
2019-08-24 01:22:52 -0400 :Working: DO: Check free space and verify SSH equivalence for the root user to exa1db01
2019-08-24 01:25:19 -0400 :SUCCESS: DONE: Check free space and verify SSH equivalence for the root user to exa1db01
2019-08-24 01:26:20 -0400 :Working: DO: dbnodeupdate.sh running a precheck on node(s).
2019-08-24 01:28:52 -0400 :ERROR : dbnodeupdate.sh precheck failed on one or more nodes
SUMMARY OF WARNINGS AND ERRORS FOR exa1db01:
exa1db01: # The following file lists the commands that would have been executed for removing rpms when specifying -M flag. #
exa1db01: # File: /var/log/cellos/nomodify_results.240819011546.sh. #
exa1db01: ERROR: Found dependency issues during pre-check. Packages failing:
exa1db01: ERROR: Package: 1:dbus-1.2.24-8.0.1.el6_6.x86_64
exa1db01: ERROR: Consult file exa1db01:/var/log/cellos/minimum_conflict_report.240819011546.txt for more information on the dependencies failing and for next steps.
exa1db01: The following known issues will be checked for but require manual follow-up:
exa1db01: (*) - Yum rolling update requires fix for 11768055 when Grid Infrastructure is below 11.2.0.2 BP12

2019-08-24 01:28:59 -0400 :ERROR : DONE: dbnodeupdate.sh precheck on exa1db01
2019-08-24 01:29:11 -0400 :INFO : SUMMARY FOR ALL NODES:
2019-08-24 01:28:59 -0400 :ERROR : exa1db01 has state: FAILED
2019-08-24 01:29:11 -0400 :FAILED : For details, check the following files in the /tmp/SAVE/dbserver_patch_19.190104:
2019-08-24 01:29:11 -0400 :FAILED : - <dbnode_name>_dbnodeupdate.log
2019-08-24 01:29:11 -0400 :FAILED : - patchmgr.log
2019-08-24 01:29:11 -0400 :FAILED : - patchmgr.trc
2019-08-24 01:29:11 -0400 :FAILED : DONE: Initiate precheck on node(s).
[INFO ] Collected dbnodeupdate diag in file: Diag_patchmgr_dbnode_precheck_240819011546.tbz
-rw-r--r-- 1 root root 2850123 Aug 24 01:29 Diag_patchmgr_dbnode_precheck_240819011546.tbz

 

  • Using the “-force_remove_custom_rpms” flag didn’t fix the issue right away, but this time there were only three custom RPM packages remaining, called “dbus”, “pm-utils”, and “hal”, which could be easily removed as follows:
[root@exa1db01 ~]# rpm -e dbus-1.2.24-8.0.1.el6_6.x86_64 pm-utils-1.2.5-11.el6.x86_64 hal-0.5.14-14.el6.x86_64
Stopping system message bus: [ OK ]

Note: I tried executing “patchmgr” with the “-force_remove_custom_rpms” flag multiple times, but these three packages were never removed by “patchmgr”, so I had to do it manually, as described above.

 

  • Ran pre-requisites check again and kicked off the patch, which completed successfully this time:
[root@exa1cel01 dbserver_patch_19.190104]# ./patchmgr -dbnodes ~/dbs_group -precheck -iso_repo /tmp/SAVE/p29181093_*_Linux-x86-64.zip -target_version 18.1.12.0.0.190111 -allow_active_network_mounts
************************************************************************************************************
NOTE patchmgr release: 19.190104 (always check MOS 1553103.1 for the latest release of dbserver.patch.zip)
NOTE
WARNING Do not interrupt the patchmgr session.
WARNING Do not resize the screen. It may disturb the screen layout.
WARNING Do not reboot database nodes during update or rollback.
WARNING Do not open logfiles in write mode and do not try to alter them.
************************************************************************************************************
2019-08-24 01:53:48 -0400 :Working: DO: Initiate precheck on 1 node(s)
2019-08-24 02:00:50 -0400 :Working: DO: Check free space and verify SSH equivalence for the root user to exa1db01
2019-08-24 02:03:17 -0400 :SUCCESS: DONE: Check free space and verify SSH equivalence for the root user to exa1db01
2019-08-24 02:04:18 -0400 :Working: DO: dbnodeupdate.sh running a precheck on node(s).
2019-08-24 02:06:57 -0400 :SUCCESS: DONE: Initiate precheck on node(s).

[root@exa1cel01 dbserver_patch_19.190104]# nohup ./patchmgr -dbnodes ~/dbs_group -upgrade -iso_repo /tmp/SAVE/p29181093_*_Linux-x86-64.zip -target_version 18.1.12.0.0.190111 -allow_active_network_mounts -rolling &

[root@exa1cel01 dbserver_patch_19.190104]# tail -f nohup.out
NOTE patchmgr release: 19.190104 (always check MOS 1553103.1 for the latest release of dbserver.patch.zip)
NOTE
NOTE Database nodes will reboot during the update process.
NOTE
WARNING Do not interrupt the patchmgr session.
WARNING Do not resize the screen. It may disturb the screen layout.
WARNING Do not reboot database nodes during update or rollback.
WARNING Do not open logfiles in write mode and do not try to alter them.
************************************************************************************************************
2019-08-24 02:07:39 -0400 :Working: DO: Initiate prepare steps on node(s).
2019-08-24 02:07:52 -0400 :Working: DO: Check free space and verify SSH equivalence for the root user to exa1db01
2019-08-24 02:08:39 -0400 :SUCCESS: DONE: Check free space and verify SSH equivalence for the root user to exa1db01
2019-08-24 02:10:08 -0400 :SUCCESS: DONE: Initiate prepare steps on node(s).
2019-08-24 02:10:08 -0400 :Working: DO: Initiate update on 1 node(s).
2019-08-24 02:10:08 -0400 :Working: DO: dbnodeupdate.sh running a backup on 1 node(s).
2019-08-24 02:16:04 -0400 :SUCCESS: DONE: dbnodeupdate.sh running a backup on 1 node(s).
2019-08-24 02:16:04 -0400 :Working: DO: Initiate update on exa1db01
2019-08-24 02:16:04 -0400 :Working: DO: Get information about any required OS upgrades from exa1db01.
2019-08-24 02:16:14 -0400 :SUCCESS: DONE: Get information about any required OS upgrades from exa1db01.
2019-08-24 02:16:19 -0400 :Working: DO: dbnodeupdate.sh running an update step on exa1db01.
2019-08-24 02:28:13 -0400 :INFO : exa1db01 is ready to reboot.
2019-08-24 02:28:13 -0400 :SUCCESS: DONE: dbnodeupdate.sh running an update step on exa1db01.
2019-08-24 02:28:29 -0400 :Working: DO: Initiate reboot on exa1db01.
2019-08-24 02:28:56 -0400 :SUCCESS: DONE: Initiate reboot on exa1db01.
2019-08-24 02:28:56 -0400 :Working: DO: Waiting to ensure exa1db01 is down before reboot.
2019-08-24 02:30:08 -0400 :SUCCESS: DONE: Waiting to ensure exa1db01 is down before reboot.
2019-08-24 02:30:08 -0400 :Working: DO: Waiting to ensure exa1db01 is up after reboot.
2019-08-24 02:36:44 -0400 :SUCCESS: DONE: Waiting to ensure exa1db01 is up after reboot.
2019-08-24 02:36:44 -0400 :Working: DO: Waiting to connect to exa1db01 with SSH. During Linux upgrades this can take some time.
2019-08-24 02:57:50 -0400 :SUCCESS: DONE: Waiting to connect to exa1db01 with SSH. During Linux upgrades this can take some time.
2019-08-24 02:57:50 -0400 :Working: DO: Wait for exa1db01 is ready for the completion step of update.
2019-08-24 02:59:02 -0400 :SUCCESS: DONE: Wait for exa1db01 is ready for the completion step of update.
2019-08-24 02:59:08 -0400 :Working: DO: Initiate completion step from dbnodeupdate.sh on exa1db01
2019-08-24 03:10:05 -0400 :SUCCESS: DONE: Initiate completion step from dbnodeupdate.sh on exa1db01.
2019-08-24 03:10:49 -0400 :SUCCESS: DONE: Initiate update on exa1db01.
2019-08-24 03:10:55 -0400 :SUCCESS: DONE: Initiate update on 0 node(s)

 

Note that the removed packages may cause some features from stop working. In my situation we had problems only with LDAP authentication. To fix it, those packages had to be re-installed after the patch and re-configured, so it’s highly advisable to review the whole list of custom RPM packages during the pre-requisites phase to detect any features that might get broken during the Exadata Patching.

Another method to deal with these packages is to manually remove each one and its dependencies.

2) ASM not starting automatically during clusterware startup:

While applying the same patch described above, Oracle wasn’t able to automatically start the clusterware for some reason, which caused the patching operation to fail. After troubleshooting I identified that the cluster could not be started, because ASM wasn’t coming up when it was supposed to, and bringing it up manually made the whole cluster start successfully. So I decided to use it as a workaround to complete the OS patching, especially because we had spent hours troubleshooting the startup issue without being able to fix it, and also because GI was going to be upgraded during the same maintenance window. As expected, the upgrade fixed the ASM startup issue.

As mentioned, there was a lot of troubleshooting on the ASM startup issue but to this day I couldn’t pinpoint the root cause of this issue.

  • Started OS patching using a cell node to prevent disconnection:
[root@exa1cel01 dbserver_patch_19.190104]# cat ~/dbs_group
exa1db03
[root@exa1cel01 dbserver_patch_19.190104]# nohup ./patchmgr -dbnodes ~/dbs_group -upgrade -iso_repo /tmp/SAVE/p29181093_*_Linux-x86-64.zip -target_version 18.1.12.0.0.190111 -allow_active_network_mounts -rolling &
[root@exa1cel01 dbserver_patch_19.190104]# tail -f nohup.out
NOTE patchmgr release: 19.190104 (always check MOS 1553103.1 for the latest release of dbserver.patch.zip)
NOTE
NOTE Database nodes will reboot during the update process.
NOTE
WARNING Do not interrupt the patchmgr session.
WARNING Do not resize the screen. It may disturb the screen layout.
WARNING Do not reboot database nodes during update or rollback.
WARNING Do not open logfiles in write mode and do not try to alter them.
************************************************************************************************************
2019-08-25 05:42:15 -0400 :Working: DO: Initiate prepare steps on node(s).
2019-08-25 05:42:29 -0400 :Working: DO: Check free space and verify SSH equivalence for the root user to exa1db03
2019-08-25 05:43:16 -0400 :SUCCESS: DONE: Check free space and verify SSH equivalence for the root user to exa1db03
2019-08-25 05:44:44 -0400 :SUCCESS: DONE: Initiate prepare steps on node(s).
2019-08-25 05:44:44 -0400 :Working: DO: Initiate update on 1 node(s).
2019-08-25 05:44:44 -0400 :Working: DO: dbnodeupdate.sh running a backup on 1 node(s).
2019-08-25 05:49:38 -0400 :SUCCESS: DONE: dbnodeupdate.sh running a backup on 1 node(s).
2019-08-25 05:49:38 -0400 :Working: DO: Initiate update on exa1db03
2019-08-25 05:49:39 -0400 :Working: DO: Get information about any required OS upgrades from exa1db03.
2019-08-25 05:49:49 -0400 :SUCCESS: DONE: Get information about any required OS upgrades from exa1db03.
2019-08-25 05:49:54 -0400 :Working: DO: dbnodeupdate.sh running an update step on exa1db03.
2019-08-25 06:02:10 -0400 :INFO : exa1db03 is ready to reboot.
2019-08-25 06:02:10 -0400 :SUCCESS: DONE: dbnodeupdate.sh running an update step on exa1db03.
2019-08-25 06:02:22 -0400 :Working: DO: Initiate reboot on exa1db03.
2019-08-25 06:02:48 -0400 :SUCCESS: DONE: Initiate reboot on exa1db03.
2019-08-25 06:02:48 -0400 :Working: DO: Waiting to ensure exa1db03 is down before reboot.
2019-08-25 06:04:10 -0400 :SUCCESS: DONE: Waiting to ensure exa1db03 is down before reboot.
2019-08-25 06:04:10 -0400 :Working: DO: Waiting to ensure exa1db03 is up after reboot.
2019-08-25 06:10:44 -0400 :SUCCESS: DONE: Waiting to ensure exa1db03 is up after reboot.
2019-08-25 06:10:44 -0400 :Working: DO: Waiting to connect to exa1db03 with SSH. During Linux upgrades this can take some time.
2019-08-25 06:31:06 -0400 :SUCCESS: DONE: Waiting to connect to exa1db03 with SSH. During Linux upgrades this can take some time.
2019-08-25 06:31:06 -0400 :Working: DO: Wait for exa1db03 is ready for the completion step of update.
2019-08-25 06:32:10 -0400 :SUCCESS: DONE: Wait for exa1db03 is ready for the completion step of update.
2019-08-25 06:32:16 -0400 :Working: DO: Initiate completion step from dbnodeupdate.sh on exa1db03
SUMMARY OF ERRORS FOR exa1db03:
2019-08-25 06:58:43 -0400 :ERROR : There was an error during the completion step on exa1db03.
2019-08-25 06:58:43 -0400 :ERROR : Please correct the error and run "/u01/dbnodeupdate.patchmgr/dbnodeupdate.sh -c" on exa1db03 to complete the update.
2019-08-25 06:58:43 -0400 :ERROR : The dbnodeupdate.log and diag files can help to find the root cause.
2019-08-25 06:58:43 -0400 :ERROR : DONE: Initiate completion step from dbnodeupdate.sh on exa1db03
2019-08-25 06:58:57 -0400 :FAILED : For details, check the following files in the /tmp/SAVE/dbserver_patch_19.190104:
2019-08-25 06:58:57 -0400 :FAILED : - <dbnode_name>_dbnodeupdate.log
2019-08-25 06:58:57 -0400 :FAILED : - patchmgr.log
2019-08-25 06:58:57 -0400 :FAILED : - patchmgr.trc
2019-08-25 06:58:57 -0400 :FAILED : DONE: Initiate update on exa1db03.
[INFO ] Collected dbnodeupdate diag in file: Diag_patchmgr_dbnode_upgrade_250819054214.tbz
-rw-r--r-- 1 root root 4027007 Aug 25 06:58 Diag_patchmgr_dbnode_upgrade_250819054214.tbz

 

  • The OS patching failed as seen above. After troubleshooting, I identified that the completion step failed because the cluster could not be started. While attempting to start it manually, I noticed that ASM wasn’t started automatically, and bringing it up manually made the whole cluster start successfully.
  • Started the cluster
[root@exa1db03 pythian]# $ORACLE_HOME/bin/crsctl start crs

 

  • Waited a few minutes and started ASM manually
[grid@exa1db03 ~]$ sqlplus / as sysasm
SQL*Plus: Release 12.1.0.2.0 Production on Sun Aug 25 07:21:39 2019
Copyright (c) 1982, 2014, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup
ASM instance started
Total System Global Area 3221225472 bytes
Fixed Size 2929552 bytes
Variable Size 3184741488 bytes
ASM Cache 33554432 bytes
ASM diskgroups mounted
SQL>exit

 

  • Checked if the cluster was started on the local node
[root@exa1db03 pythian]# $ORACLE_HOME/bin/crsctl stat res -t

 

  • Executed “dbnodeupdate.sh” with “-c” flag, to execute the completion step, and “-s” flag, to shut down the system automatically:
[root@exa1db03 pythian]# /u01/dbnodeupdate.patchmgr/dbnodeupdate.sh -c -s
(*) 2019-08-25 07:12:50: Initializing logfile /var/log/cellos/dbnodeupdate.log
##########################################################################################################################
# #
# Guidelines for using dbnodeupdate.sh (rel. 19.190104): #
# #
# - Prerequisites for usage: #
# 1. Refer to dbnodeupdate.sh options. See MOS 1553103.1 #
# 2. Always use the latest release of dbnodeupdate.sh. See patch 21634633 #
# 3. Run the prereq check using the '-v' flag. #
# 4. Run the prereq check with the '-M' to allow rpms being removed and preupdated to make precheck work. #
# #
# I.e.: ./dbnodeupdate.sh -u -l /u01/my-iso-repo.zip -v (may see rpm conflicts) #
# ./dbnodeupdate.sh -u -l http://my-yum-repo -v -M (resolved known rpm comflicts) #
# #
# - Prerequisite rpm dependency check failures can happen due to customization: #
# - The prereq check detects dependency issues that need to be addressed prior to running a successful update. #
# - Customized rpm packages may fail the built-in dependency check and system updates cannot proceed until resolved. #
# - Prereq check may fail because -M flag was not used and known conflicting rpms were not removed. #
# #
# When upgrading to releases 11.2.3.3.0 or later: #
# - When 'exact' package dependency check fails 'minimum' package dependency check will be tried. #
# - When 'minimum' package dependency check fails, conflicting packages should be removed before proceeding. #
# #
# - As part of the prereq checks and as part of the update, a number of rpms will be removed. #
# This removal is required to preserve Exadata functioning. This should not be confused with obsolete packages. #
# Running without -M at prereq time may result in a Yum dependency prereq checks fail #
# #
# - In case of any problem when filing an SR, upload the following: #
# - /var/log/cellos/dbnodeupdate.log #
# - /var/log/cellos/dbnodeupdate.<runid>.diag #
# - where <runid> is the unique number of the failing run. #
# #
# #
##########################################################################################################################
Continue ? [y/n] y
(*) 2019-08-25 07:13:09: Unzipping helpers (/u01/dbnodeupdate.patchmgr/dbupdate-helpers.zip) to /opt/oracle.SupportTools/dbnodeupdate_helpers
(*) 2019-08-25 07:13:12: Collecting system configuration settings. This may take a while...
Active Image version : 18.1.12.0.0.190111
Active Kernel version : 4.1.12-94.8.10.el6uek
Active LVM Name : /dev/mapper/VGExaDb-LVDbSys1
Inactive Image version : 12.1.2.3.5.170418
Inactive LVM Name : /dev/mapper/VGExaDb-LVDbSys2
Current user id : root
Action : finish-post (validate image status, fix known issues, cleanup, relink and enable crs to auto-start)
Shutdown stack : Yes (Currently stack is up)
Logfile : /var/log/cellos/dbnodeupdate.log (runid: 250819071250)
Diagfile : /var/log/cellos/dbnodeupdate.250819071250.diag
Server model : ORACLE SERVER X6-2
dbnodeupdate.sh rel. : 19.190104 (always check MOS 1553103.1 for the latest release of dbnodeupdate.sh)

The following known issues will be checked for but require manual follow-up:
(*) - Yum rolling update requires fix for 11768055 when Grid Infrastructure is below 11.2.0.2 BP12

Continue ? [y/n] y
(*) 2019-08-25 07:15:43: Verifying GI and DB's are shutdown
(*) 2019-08-25 07:15:44: Shutting down GI and db
(*) 2019-08-25 07:16:20: No rpms to remove
(*) 2019-08-25 07:16:25: EM agent in /oracle/ocagent/agent_13.2.0.0.0 stopped
(*) 2019-08-25 07:16:25: Relinking all homes
(*) 2019-08-25 07:16:25: Unlocking /u01/app/12.1.0.2/grid
(*) 2019-08-25 07:16:38: Relinking /oracle/product/12.1.0.2 as orabpmp (with rds option)
(*) 2019-08-25 07:19:24: Relinking /u01/app/12.1.0.2/grid as grid (with rds option)
(*) 2019-08-25 07:19:33: Relinking /u01/app/oracle/product/12.1.0.2/dbhome_1 as oracle (with rds option)
(*) 2019-08-25 07:19:47: Locking and starting Grid Infrastructure (/u01/app/12.1.0.2/grid)
(*) 2019-08-25 07:22:17: Sleeping another 60 seconds while stack is starting (1/15)
(*) 2019-08-25 07:23:17: Sleeping another 60 seconds while stack is starting (2/15)
(*) 2019-08-25 07:23:17: Stack started
(*) 2019-08-25 07:23:48: TFA Started
(*) 2019-08-25 07:23:48: Enabling stack to start at reboot. Disable this when the stack should not be starting on a next boot
(*) 2019-08-25 07:24:00: EM agent in /oracle/ocagent/agent_13.2.0.0.0 started
(*) 2019-08-25 07:24:01: Purging any extra jdk packages.
(*) 2019-08-25 07:24:01: No jdk package cleanup needed. Retained jdk package installed: jdk1.8-1.8.0_191.x86_64
(*) 2019-08-25 07:24:01: Retained the required kernel-transition package: kernel-transition-2.6.32-0.0.0.3.el6
(*) 2019-08-25 07:24:01: Removed obsolete package: kernel-uek-firmware-2.6.39-400.294.4.el6uek.noarch
(*) 2019-08-25 07:24:16: Capturing service status and file attributes. This may take a while...
(*) 2019-08-25 07:24:16: Service status and file attribute report in: /etc/exadata/reports
(*) 2019-08-25 07:24:16: All post steps are finished.
[root@exa1db03 pythian]#

 

  • During the completion step, “dbnodeupdate.sh” script will stop the local clusterware and attempt to restart it. But since ASM was not starting automatically, the script failed. To prevent this issue, once the “dbnodeupdate.sh” tried to start CRS, I logged in to a different console and manually started ASM, after seeing the messages below:
(*) 2019-08-25 07:19:33: Relinking /u01/app/oracle/product/12.1.0.2/dbhome_1 as oracle (with rds option)
(*) 2019-08-25 07:19:47: Locking and starting Grid Infrastructure (/u01/app/12.1.0.2/grid)
(*) 2019-08-25 07:22:17: Sleeping another 60 seconds while stack is starting (1/15)

  • Logged into a different console for the same node, checked CRS processes, and started ASM manually:
[root@exa1db03 ~]# ps -ef | grep grid
root 163575 81361 0 23:19 ? 00:00:00 /bin/sh /u01/app/12.1.0.2/grid/crs/install/rootcrs.sh -patch
root 173706 163577 0 23:19 ? 00:00:00 /u01/app/12.1.0.2/grid/bin/crsctl.bin start crs -wait
root 173710 1 2 23:19 ? 00:00:07 /u01/app/12.1.0.2/grid/bin/ohasd.bin reboot _ORA_BLOCKING_STACK_LOCALE=AMERICAN_AMERICA.AL32UTF8
root 174046 1 5 23:20 ? 00:00:14 /u01/app/12.1.0.2/grid/bin/orarootagent.bin
grid 174110 1 0 23:20 ? 00:00:01 /u01/app/12.1.0.2/grid/bin/oraagent.bin
grid 174123 1 0 23:20 ? 00:00:00 /u01/app/12.1.0.2/grid/bin/mdnsd.bin
grid 174125 1 0 23:20 ? 00:00:01 /u01/app/12.1.0.2/grid/bin/evmd.bin
grid 174144 1 0 23:20 ? 00:00:00 /u01/app/12.1.0.2/grid/bin/gpnpd.bin
grid 174170 1 1 23:20 ? 00:00:02 /u01/app/12.1.0.2/grid/bin/gipcd.bin
root 174229 1 0 23:20 ? 00:00:00 /u01/app/12.1.0.2/grid/bin/cssdmonitor
root 174245 1 0 23:20 ? 00:00:00 /u01/app/12.1.0.2/grid/bin/cssdagent
grid 174247 1 1 23:20 ? 00:00:05 /u01/app/12.1.0.2/grid/bin/diskmon -d -f
grid 174269 1 5 23:20 ? 00:00:15 /u01/app/12.1.0.2/grid/bin/ocssd.bin
root 174928 1 0 23:20 ? 00:00:01 /u01/app/12.1.0.2/grid/bin/octssd.bin reboot
root 190692 111542 0 23:24 pts/0 00:00:00 grep grid

[root@exa1db03 ~]# su - grid


[grid@exa1db03 ~]$ . oraenv
ORACLE_SID = [grid] ? +ASM3
The Oracle base has been set to /u01/app/grid


[grid@exa1db03 ~]$ sqlplus / as sysasm
SQL*Plus: Release 12.1.0.2.0 Production on Sun Aug 25 23:25:49 2019
Copyright (c) 1982, 2014, Oracle. All rights reserved.
Connected to an idle instance.


SQL> startup
ASM instance started
Total System Global Area 3221225472 bytes
Fixed Size 2929552 bytes
Variable Size 3184741488 bytes
ASM Cache 33554432 bytes
ASM diskgroups mounted
SQL> exit
Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options

 

  • After manually starting ASM, the whole clusterware was started, and “dbnodeupdate.sh” script was able to detect it and execute the completion step successfully:
(*) 2019-08-25 07:23:17: Sleeping another 60 seconds while stack is starting (2/15)
(*) 2019-08-25 07:23:17: Stack started
(*) 2019-08-25 07:23:48: TFA Started
(*) 2019-08-25 07:23:48: Enabling stack to start at reboot. Disable this when the stack should not be starting on a next boot
(*) 2019-08-25 07:24:00: EM agent in /oracle/ocagent/agent_13.2.0.0.0 started
(*) 2019-08-25 07:24:01: Purging any extra jdk packages.
(*) 2019-08-25 07:24:01: No jdk package cleanup needed. Retained jdk package installed: jdk1.8-1.8.0_191.x86_64
(*) 2019-08-25 07:24:01: Retained the required kernel-transition package: kernel-transition-2.6.32-0.0.0.3.el6
(*) 2019-08-25 07:24:01: Removed obsolete package: kernel-uek-firmware-2.6.39-400.294.4.el6uek.noarch
(*) 2019-08-25 07:24:16: Capturing service status and file attributes. This may take a while...
(*) 2019-08-25 07:24:16: Service status and file attribute report in: /etc/exadata/reports
(*) 2019-08-25 07:24:16: All post steps are finished

3) Failure at rootupgrade.sh script during GI upgrade to 18c (error: “kfod op=cellconfig Died at crsutils.pm line 15183” ):

 

  • During the GI upgrade from 12.1 to 18c (18.5), “rootupgrade.sh” script failed with the following error:
2019/08/25 23:40:25 CLSRSC-595: Executing upgrade step 4 of 19: 'GenSiteGUIDs'.
2019/08/25 23:40:25 CLSRSC-180: An error occurred while executing the command '/u01/app/18.1.0.0/grid/bin/kfod op=cellconfig'
Died at /u01/app/18.1.0.0/grid/crs/install/crsutils.pm line 15183.

 

  • The log files indicated the following:
 >  CLSRSC-595: Executing upgrade step 4 of 19: 'GenSiteGUIDs'.
>End Command output
2019-08-25 23:40:25: CLSRSC-595: Executing upgrade step 4 of 19: 'GenSiteGUIDs'.
2019-08-25 23:40:25: Site name for Cluster: exa1-cluster
2019-08-25 23:40:25: It is non-extended cluster. Get node list from NODE_NAME_LIST, and site from cluster name.
2019-08-25 23:40:25: NODE_NAME_LIST: exa1db01,exa1db02,exa1db03,exa1db04
2019-08-25 23:40:25: The site for node exa1db01 is: exa1-cluster
2019-08-25 23:40:25: The site for node exa1db02 is: exa1-cluster
2019-08-25 23:40:25: The site for node exa1db03 is: exa1-cluster
2019-08-25 23:40:25: The site for node exa1db04 is: exa1-cluster
2019-08-25 23:40:25: leftVersion=12.1.0.2.0; rightVersion=12.2.0.0.0
2019-08-25 23:40:25: [12.1.0.2.0] is lower than [12.2.0.0.0]
2019-08-25 23:40:25: ORACLE_HOME = /u01/app/18.1.0.0/grid
2019-08-25 23:40:25: Running as user grid: /u01/app/18.1.0.0/grid/bin/kfod op=cellconfig
2019-08-25 23:40:25: Removing file /tmp/XD5jvGA_2v
2019-08-25 23:40:25: Successfully removed file: /tmp/XD5jvGA_2v
2019-08-25 23:40:25: pipe exit code: 256
2019-08-25 23:40:25: /bin/su exited with rc=1

2019-08-25 23:40:25: kfod op=cellconfig rc: 1
2019-08-25 23:40:25: execute 'kfod op=cellconfig' failed with error: Error 49802 initializing ADR
 ERROR!!! could not initialize the diag context

2019-08-25 23:40:25: Executing cmd: /u01/app/18.1.0.0/grid/bin/clsecho -p has -f clsrsc -m 180 '/u01/app/18.1.0.0/grid/bin/kfod op=cellconfig'
2019-08-25 23:40:25: Executing cmd: /u01/app/18.1.0.0/grid/bin/clsecho -p has -f clsrsc -m 180 '/u01/app/18.1.0.0/grid/bin/kfod op=cellconfig'
2019-08-25 23:40:25: Command output:
>  CLSRSC-180: An error occurred while executing the command '/u01/app/18.1.0.0/grid/bin/kfod op=cellconfig'

 

  • I reviewed some documents at MOS and found “Clusterware upgrade failed with CLSRSC-180 (An error occurred while executing the command ‘/bin/kfod op=cellconfig’) (Doc ID 2418540.1)”, which provided some steps to troubleshoot the issue:
[grid@exa1db04 pythian]$ /u01/app/18.1.0.0/grid/bin/kfod op=cellconfig
Error 49802 initializing ADR
ERROR!!! could not initialize the diag context


[grid@exa1db04 pythian]$ strace /u01/app/18.1.0.0/grid/bin/kfod
execve("/u01/app/18.1.0.0/grid/bin/kfod", ["/u01/app/18.1.0.0/grid/bin/kfod"], [/* 21 vars */]) = 0
brk(0)                                  = 0x15b9000

....

access("/u01/app/grid/crsdata/debug", W_OK) = 0
stat("/u01/app/grid/crsdata/debug/kfod_trace_needed.txt", 0x7ffecd968190) = -1 ENOENT (No such file or directory)

 

  • Which indicated it could be some permission issues at the diagnostic directories:

Note: Fixing the directories indicated by the “strace” command above did not fix the issue, so we had to dig into all ADR directories to find which ones had incorrect permissions, these directories are listed below:

/u01/app/grid/crsdata/debug/kfod*
/u01/app/grid/diag/kfod/exa1db04/kfod/log/*
/u01/app/18.1.0.0/grid/log/diag/

 

  • After reviewing the entire ADR destination and fixing the incorrect permissions, I was able to execute the “kfod” script described by the Oracle note and the “rootupgrade.sh” script also completed successfully, thus upgrading GI to 18c:
[grid@exa1db04 grid]$ /u01/app/18.1.0.0/grid/bin/kfod op=cellconfig
cell_data_ip192.168.10.54;192.168.10.55cell_nameexa1cel07cell_management_ip10.108.106.12cell_id1710NM781Vcell_versionOSS_18.1.12.0.0_LINUX.X64_190111cell_make_modelOracle Corporation ORACLE SERVER X6-2L_EXTREME_FLASHdiscovery_statusreachablecell_site_id00000000-0000-0000-0000-000000000000cell_site_namecell_rack_id00000000-0000-0000-0000-000000000000cell_rack_name
cell_data_ip192.168.10.52;192.168.10.53cell_nameexa1cel06cell_management_ip10.108.106.11cell_id1710NM781Gcell_versionOSS_18.1.12.0.0_LINUX.X64_190111cell_make_modelOracle Corporation ORACLE SERVER X6-2L_EXTREME_FLASHdiscovery_statusreachablecell_site_id00000000-0000-0000-0000-000000000000cell_site_namecell_rack_id00000000-0000-0000-0000-000000000000cell_rack_name
cell_data_ip192.168.10.50;192.168.10.51cell_nameexa1cel05cell_management_ip10.108.106.10cell_id1710NM781Jcell_versionOSS_18.1.12.0.0_LINUX.X64_190111cell_make_modelOracle Corporation ORACLE SERVER X6-2L_EXTREME_FLASHdiscovery_statusreachablecell_site_id00000000-0000-0000-0000-000000000000cell_site_namecell_rack_id00000000-0000-0000-0000-000000000000cell_rack_name
cell_data_ip192.168.10.48;192.168.10.49cell_nameexa1cel04cell_management_ip10.108.106.9cell_id1710NM7826cell_versionOSS_18.1.12.0.0_LINUX.X64_190111cell_make_modelOracle Corporation ORACLE SERVER X6-2L_EXTREME_FLASHdiscovery_statusreachablecell_site_id00000000-0000-0000-0000-000000000000cell_site_namecell_rack_id00000000-0000-0000-0000-000000000000cell_rack_name
cell_data_ip192.168.10.46;192.168.10.47cell_nameexa1cel03cell_management_ip10.108.106.8cell_id1710NM780Mcell_versionOSS_18.1.12.0.0_LINUX.X64_190111cell_make_modelOracle Corporation ORACLE SERVER X6-2L_EXTREME_FLASHdiscovery_statusreachablecell_site_id00000000-0000-0000-0000-000000000000cell_site_namecell_rack_id00000000-0000-0000-0000-000000000000cell_rack_name
cell_data_ip192.168.10.44;192.168.10.45cell_nameexa1cel02cell_management_ip10.108.106.7cell_id1710NM781Dcell_versionOSS_18.1.12.0.0_LINUX.X64_190111cell_make_modelOracle Corporation ORACLE SERVER X6-2L_EXTREME_FLASHdiscovery_statusreachablecell_site_id00000000-0000-0000-0000-000000000000cell_site_namecell_rack_id00000000-0000-0000-0000-000000000000cell_rack_name
cell_data_ip192.168.10.42;192.168.10.43cell_nameexa1cel01cell_management_ip10.108.106.6cell_id1710NM7815cell_versionOSS_18.1.12.0.0_LINUX.X64_190111cell_make_modelOracle Corporation ORACLE SERVER X6-2L_EXTREME_FLASHdiscovery_statusreachablecell_site_id00000000-0000-0000-0000-000000000000cell_site_namecell_rack_id00000000-0000-0000-0000-000000000000cell_rack_name



[root@exa1db04 ]#  /u01/app/18.1.0.0/grid/rootupgrade.sh
 Check /u01/app/18.1.0.0/grid/install/root_exa1db04.example.com_2019-08-26_00-50-48-539460928.log for the output of root script


[root@exa1db04 ~]# sudo su - 
[grid@exa1db04 ~]$ . oraenv <<< +ASM4


[grid@exa1db04 ~]$ crsctl query crs softwareversion
Oracle Clusterware version on node [exa1db04] is [18.0.0.0.0]


[grid@exa1db04 ~]$ crsctl query crs activeversion
Oracle Clusterware active version on the cluster is [18.0.0.0.0]

 

As mentioned earlier, patching an Exadata is hardly what it used to be. Most of the errors described here were fixed by analyzing the log files generated by “patchmgr” and “dbnodeupdate.sh”, so even if something goes wrong, Oracle’s set of tools will still provide you with enough information to allow you to troubleshoot the issue and proceed with the patching.

email

Interested in working with Fernando? Schedule a tech call.

Senior Oracle Database Consultant

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *