Now that we know how to patch every component and the different options available to do so (rolling, non-rolling), which one is the best? How much time does it take?
The answer is obviously “it depends” but I will try to bring few insights so you can have a bright answer when your manager inevitably asks you “How long will that patch be? I need to negotiate the window maintenance with the business… they aren’t happy…” ;)
Here is a summary of the length of the patch application in a Rolling fashion and in a Non-Rolling fashion (as well as the downtime for each method). Please note that I put in green what I recommend.
- Rolling : 1h30 x number of cells
- Rolling downtime : 0 minute
- Non-rolling : 2h (1h30 to patch a cell + 30 minutes to stop and start everything before and after the patch)
- Non-rolling downtime : 2h
Note : Refer to my notes at the end of this page about this choice
- Rolling : 45 minutes per switch then 1h30 total
- Rolling downtime : 0 minute
- Non-rolling : not available
- Non-rolling downtime : not available
Note: There’s no non-rolling method for the IB Switches then here the choice is an easy one!
- Rolling : 1h per node
- Rolling downtime : It can be 0 minutes if you make a good use of the Oracle services (as described here for the Grid patching. You can apply the same concept for the database servers patching as well)
- Non-rolling : 1h
- Non-rolling downtime : 1h
Note: Refer to my notes at the end of this page about this choice
- Rolling : 30 – 45 minutes per node
- Rolling downtime: Can be 0 minute if you make a good use of the Oracle services as described in this paragraph
- Non-rolling : 30 – 45 minutes
- Non-rolling downtime : 30 – 45 minutes for all the instances running on the node you patch
Note: No green color here? To patch the grid, I recommend to go for a mix like:
- Rebalance the services away from node 1
- Patch the node 1
- Verify that everything is well restarted on the node 1
- Move all the services to the node 1 (if it is possible that only one node can handle the whole activity – but usually we patch during a quiet period)
- Apply the patch in a non-rolling method (for the Grid it means launching the patch manually in parallel on the remaining nodes)
- Once the grid has been patched on all the nodes, restart all the services as they were before the patch
- Rolling: 20 – 30 minutes per node + ~ 20 minutes per database for the post installation steps
- Rolling downtime:- Can be 0 minute if you rebalance the services before patching a node (as described here for the Grid patching, you can apply the same concept for the database patching as well) + ~ 20 minutes per database for the post installation steps.
Please note that if you have 30 databases sharing the same ORACLE_HOME, you won’t be able to easily apply 30 post-install steps at the same time then the 30th database will suffer a bigger outage than the 1st one you restart on the patched ORACLE_HOME. This is why I strongly recommend the use of this quicker method.
– An ~ 20 minutes downtime per database you can chose when using the quicker way !
- Non-rolling: 20 – 30 minutes
- Non-rolling downtime: 20 – 30 minutes for all the databases running on the patched Oracle home + ~ 20 minutes per database for the post installation steps. Note that if you have 30 databases sharing the same ORACLE_HOME, you won’t be able to apply 30 post-install steps at the same time then the 30th database will suffer a bigger outage than the 1st one you restart on the patched ORACLE_HOME.
Note: In this instance, I will definitely go for the quicker way ! : clone the Oracle home you want to patch to another one, apply the patch and move the databases one by one to the new patched Oracle home
Notes on my recommendations
Yes, I always prefer the rolling method for the Infrastructure components (Grid and Database Servers). This is because I can mitigate the outage and I’m also sure to avoid any outage created by the patch or anything preventing for example a reboot as we do not reboot those servers frequently.
Imagine if you go for a cell rolling upgrade and one cell does not reboot after the patch. You’ll have no issue here as the patch will stop automatically; everything will work as before with one cell down, no one will notice anything, you are still supported as it is supported to run different version across different servers. You can then quietly check the troubleshooting section of this blog or go to the pool while Oracle finds a solution for you.
It happened to us on production (it didn’t happen on the DEV on QA Exadatas before…), we warned the client and it took few days to Oracle to provide an action plan. All ran perfectly during a week with a cell down, we then applied the Oracle action plan during the next week-end and could properly finish the patch. The result here is that we applied the patch successfully. We had an issue that caused no outage nor performance degradation and we still fit in the maintenance window – very good job from a client and process point of view !
But if you go for a non-rolling cell patching and all your cells (or few of them) do not reboot after the patch, then you are in trouble and you will lose ten times the time you think you could have won by doing a non-rolling manner. You will most likely have a failed patch outside of the maintenance window, a Root Cause Analysis to provide to the process guys and you probably won’t patch this Exadata any more for a while as the client will be… hmmm… a bit chilly about that question in the future.
And this risk is the same for the databases servers.
I do not say that the Bundle won’t work and create a big outage (I did a lot and it works pretty well), it is just all about risk mitigation. And remember: “highest level of patch = highest level of bug” :)
If you’ve reached this point, I hope that you enjoyed this Odyssey into the Exadata patching world as much as I enjoy working with it on a daily basis!