How to patch an exadata (part 5) - troubleshooting

Fred Denis

March 28, 2017

Quick links to Part 1 / Part 2 / Part 3 / Part 4 / Part 5 / Part 6

5: Troubleshooting

In this post I'll be sharing a few issues that we faced and sorted out. From a 'lessons learned' perspective, they are worth sharing in order to help others. Please note that they've all been applied in real life on X4 and/or X5 Exadatas.

5.1 - Cell patching issue

It happened when the patch failed on a cell:

myclustercel05 2016-05-31 03:46:42 -0500 Patch failed during wait for patch finalization and reboot.
 2016-05-31 03:46:43 -0500 4 Done myclustercel05 :FAILED: Details in files .log /patches/April_bundle_patch/22738457/Infrastructure/12.1.2.3.1/ExadataStorageServer_InfiniBandSwitch/patch_12.1.2.3.1.160411/patchmgr.stdout, /patches/April_bundle_patch/22738457/Infrastructure/12.1.2.3.1/ExadataStorageServer_InfiniBandSwitch/patch_12.1.2.3.1.160411/patchmgr.stderr
 2016-05-31 03:46:43 -0500 4 Done myclustercel05 :FAILED: Wait for cell to reboot and come online.

Checking this logfile on the cell, we can see that it failed due to a reduced redundancy:

/opt/oracle/cell12.1.2.1.2_LINUX.X64_150617.1/.install_log.txt
 CELL-02862: Deactivation of grid disks failed due to reduced redundancy of the following grid disks: DATA_CD_00_myclustercel05, DATA_CD_01_myclustercel05, DATA_CD_02_myclustercel05, DATA_CD_03_myclustercel05, DATA_CD_04_myclustercel05, DATA_CD_05_myclustercel05, DATA_CD_06_myclustercel05, DATA_CD_07_myclustercel05, DATA_CD_08_myclustercel05, DATA_CD_09_myclustercel05, DATA_CD_10_myclustercel05, DATA_CD_11_myclustercel05....

It was due to the fact that the previous cell disks were not brought online after the reboot. In this case, we have to bring the disks online manually on the previous cell and resume the patch on the remaining cells
- Bring disks online manually on the failed cell:

# ssh root@myclustercel05
 # cellcli -e alter griddisk all active
 # cellcli -e list griddisk attributes name, asmmodestatus # to check the status of the disks

... wait until all disks are "ONLINE" ...

- Restart the patch on the remaining cells (cel06 and cel07)

# cd 
 # cat ~/cell_group | grep [67] > cells_6_and_7
 # ./patchmgr -cells cells_6_and_7 -cleanup
 # ./patchmgr -cells cells_6_and_7 -patch_check_prereq -rolling
 # ./patchmgr -cells cells_6_and_7 -patch -rolling
 # ./patchmgr -cells ~/cell_group -cleanup

5.2 - CRS does not restart Issue

It happened that after a failed Grid patch, CRS was unable to restart. We opened a SR and Oracle came with an action plan to restart the GI. Let's say the issue happened on server myclusterdb03 here:

Stop the clusterware

[root@myclusterdb03]# crsctl stop crs -f

Remove the network sockets

[root@myclusterdb03]# cd /var/tmp/.oracle
 [root@myclusterdb03]# rm -f *

Remove the maps files

[root@myclusterdb03]# cd /etc/oracle/maps/
 [root@myclusterdb03]# mv myclusterdb03_gipcd1318_cc0d4e3b8eedcf02bf179a98a71ce468-0000000000 X-myclusterdb03_gipcd1318_cc0d4e3b8eedcf02bf179a98a71ce468-0000000000

Start the clusterware

[root@myclusterdb03]# crsctl start crs

The Clusterware, upon starting, will recreate network sockets and maps file.

5.3 - A Procedure to Add Instances to A Database

The following is a procedure that I performed after a CRS patch failed on a node 3. In this case, some databases were only running on nodes 3 and 4. As we had an issue on node 3 CRS patching, we opted to move these databases to nodes 1 and 2 before the end of the maintenance window so we could then work on the failed node 3 quietly with no downtime. The patch on node 4 was next and was also completed with no downtime. The goal was to add two instances on nodes 1 and 2 to the database mydb:

select tablespace_name, file_name from dba_data_files where tablespace_name like 'UNDO%' ;
 create undo tablespace UNDOTBS1 datafile '+DATA' ;
 create undo tablespace UNDOTBS2 datafile '+DATA' ;
 alter system set undo_tablespace='UNDOTBS1' sid='mydb1' ;
 alter system set undo_tablespace='UNDOTBS2' sid='mydb2' ;
 
 show spparameter instance
 alter system set instance_number=3 sid='mydb1' scope=spfile ;
 alter system set instance_number=4 sid='mydb2' scope=spfile ;
 alter system set instance_name='mydb1' sid='mydb1' scope=spfile ;
 alter system set instance_name='mydb2' sid='mydb2' scope=spfile ;
 
 show spparameter thread ;
 alter system set thread=1 sid='mydb1' scope=spfile ;
 alter system set thread=2 sid='mydb2' scope=spfile ;
 
 set lines 200
 set pages 999
 select * from gv$log ;
 alter database add logfile thread 1 group 11 ('+DATA', '+RECO') size 100M, group 12 ('+DATA', '+RECO') size 100M, group 13 ('+DATA', '+RECO') size 100M, group 14 ('+DATA', '+RECO') size 100M ;
 alter database add logfile thread 2 group 21 ('+DATA', '+RECO') size 100M, group 22 ('+DATA', '+RECO') size 100M, group 23 ('+DATA', '+RECO') size 100M, group 24 ('+DATA', '+RECO') size 100M ;
 select * from gv$log ;
 
 alter database enable public thread 1 ;
 alter database enable public thread 2 ;
 
 srvctl add instance -db mydb -i mydb1 -n myclusterdb01
 srvctl add instance -db mydb -i mydb2 -n myclusterdb02
 srvctl status database -d mydb
 
 sqlplus / as sysdba
 select host_name, status from gv$instance ;
 
 srvctl modify service -d mydb -s myservice -modifyconfig -preferred 'mydb1,mydb2,mydb3,mydb4'
 srvctl modify service -d mydb -s myservice -modifyconfig -preferred 'mydb1,mydb2,mydb3,mydb4'
 srvctl start service -d mydb -s myservice -i mydb1
 srvctl start service -d mydb -s myservice -i mydb2
 srvctl start service -d mydb -s myservice -i mydb1
 srvctl start service -d mydb -s myservice -i mydb2

5.4 - OPatch Resume

As general advice, if an opatch/ opatchauto operation fails, try to resume it:

[root@myclusterdb03]# cd /patches/OCT2016_bundle_patch/24436624/Database/12.1.0.2.0/12.1.0.2.161018DBBP/24448103
 [root@myclusterdb03 24448103]# /u01/app/12.1.0.2/grid/OPatch/opatchauto resume -oh /u01/app/12.1.0.2/grid