20 mins vs 2 hours
Recently, I have noticed that re-imaging process on the second Oracle Database Appliance node took significantly less time compared with the first node. The difference was so significant that I started to suspect that there was something wrong with either particular set of hardware or that some of the re-imaging process steps had failed on the second node. On the first node the process had completed in 120 minutes, but on the second, it took only around 20 minutes.
I spent quite a bit of time trying to understand that exactly was happening. But before I tell you, can I ask you what theoretical options you would have come up with given the behavior I just described? Please share those with me in the comment section below. :)
Any mystery can be solved
The question is: Are we ready to pay for it? Sometimes, it takes quite a bit of effort to get to the truth, and very often, we don’t have the time, interest, or budget to find it. In this particular case, I was so curious that I had spent a good part of my weekend looking for a clue. Along the way, I had to learn a bit about “Anaconda (installer)“, SquashFS file system, how to rebuild ISO image, and how ODA re-imaging process works. The purpose of this paragraph is to encourage you to be curious and not leave mysteries unresolved. Invest some time, and you will learn a lot on the way :)
NOTE: I will try to share the way I troubleshot this problem in my future blog posts.
Bug in the “post-install” script
It appears that the problem is in the way the ISO:/Extras/setupodaovm.sh post install script checks if software RAID has completed re-synchronization of 4 internal HDD partitions (md devices) between 2 physical disks. The following check is at the very end of the script:
mdadm --wait /dev/md1
mdadm --wait /dev/md2
mdadm --wait /dev/md3
Each of the lines is designed to check if the software RAID completed synchronizing an md device (partition). The following is part of man page for mdadm utility:
-W, --wait For each md device given, wait for any resync, recovery, or reshape activity to finish before returning. mdadm will return with success if it actually waited for every device listed, otherwise it will return failure.
During the re-imaging process, all 4 volumes have to be rebuilt and need to be synchronized by the software RAID. It is worth mentioning that software RAID on ODA is configured to re-synchronize one device at the time. This leaves other devices just sitting and waiting their turn with the status DELAYED. The problem is that if a device is in the state resync=DELAYED the “mdadm –wait” check will not stop and wait for it. Therefore, just one of the mdadm checks will wait until re-synchronization process finishes. Others successfully pass even if a device isn’t synchronized yet (resync=DELAYED). Now let’s have a look on the devices’ sizes and associated synchronization times:
Just to make life a bit more interesting, the software RAID picks up the next device to be re-synchronized randomly. This means that luck decides which device will get processed next. If it is md1 device (17GB), then the whole re-imaging process takes 20 minutes. However, if the software RAID synchronizes md2 device (217GB) during the execution of the mdadm check, the re-imaging process takes about 120 minutes.
A way to fix the problem
I am not a great expert in the Linux System Administration area (I am an Oracle DBA after all), and would rather let Oracle folks make the final call. However, it seems to me that in order to make sure that all 4 devices got re-synchronized before the re-imaging process finishes, the check should look like the following:
mdadm --wait /dev/md0 /dev/md1 /dev/md2 /dev/md3
To conclude, until the issue is fixed know that:
- You may face different ODA nodes’ re-imaging times.
- To be on the safe side, you should check if md devices’ re-synchronization process is finished by running “cat /proc/mdstat” command before running any business critical processes on your ODA.
PS “Stay Hungry Stay Foolish” – Steve Jobs