If you are using Oracle Database Appliance (ODA) then this patch is very important to apply. The patch bundle 2.1.0.3.0 has been published just this week and the most important fix from my perspective is the new BIOS version 12010304.
Intel CPUs have a feature called Software Controlled Clock Modulation that allows programmatically control of clock modulation duty cycle, which basically reduces working clock frequency for the CPU. It’s intended to control CPU power consumption and often works along the line of thermal control mechanism. CPUs have Model-Specific Registers (MSRs) and there is MSR IA32_CLOCK_MODULATION that control clock modulation duty cycle. There are many other MSRs. For example, there are MSRs that gives you access to on-die thermal sensors like IA32_THERM_STATUS. For more information about IA32_CLOCK_MODULATION MSR see section 14.5.3 of Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3B.
Back to the ODA… Apparently there was a bug in the ODA BIOS 12010303 which set IA32_CLOCK_MODULATION to the duty cycle of 25%. Thus, customers of the database appliance could see slow CPU performance especially when system becomes CPU bound. Linux has interface to the CPUs MSR via /dev/cpu/{cpunum}/msr device and administrators can control write and read data from certain offsets to read and set MSRs. This was the simple workaround disabling clock modulation.
Later there was a BIOS fix with manual instructions to apply it using ipmiflash and I have a chance to test it and verify before 2.1.0.3.0 patch was available. Many ODA customers applied that patch successfully as well. Thanks to the new patch bundle, there is no need to fiddle around with manual instructions anymore. Patch 13622348 is the 2.1.0.3.0 patch bundle.
The simplest way to demonstrate CPU performance hit caused by misconfigured clock modulation duty cycle is using database system statistics (thanks to the OakTable Network mailing list of this idea). The non-patched ODA would show performance in 600s of millions operations per second.
[sql]BEGINDBMS_STATS.GATHER_SYSTEM_STATS();
END;
/
SELECT pname, pval1
FROM sys.aux_stats$
WHERE sname = ‘SYSSTATS_MAIN’ and PNAME=’CPUSPEEDNW’;
PNAME PVAL1
—————————— ———————-
CPUSPEEDNW 673
1 row selected[/sql]
The patched ODA gives 4 times higher CPU performance which is basically the same as you would observe on an Exadata X2-2 database server that is using the same CPUs:
[sql]BEGINDBMS_STATS.GATHER_SYSTEM_STATS();
END;
/
SELECT pname, pval1
FROM sys.aux_stats$
WHERE sname = ‘SYSSTATS_MAIN’ and PNAME=’CPUSPEEDNW’;
PNAME PVAL1
—————————— ———————-
CPUSPEEDNW 2713
1 row selected[/sql]
In the past, I was doing quite a bit of stress testing of ODA and I focused a lot on the I/O subsystem so I wanted to verify the impact of slower CPU performance on I/O. I believe you don’t really need fast CPUs to saturate 20 tradition HDDs (and my quick tests showed about the same results indeed) but I wanted to verify how it would impact I/O performance when it comes to LGWR writing to SSD.
To compare LGWR performance on patched vs non-patched ODA, I manually upgraded the BIOS on one ODA node only (node 1) while leaving the second node intact. I have then setup a benchmark with parallel threads equally distributed across two database instances and each thread inserting a row in its own table, so there is no contention between the threads, and committing after each row. What I wanted is to simulate very high transactions rate. To monitor the transactions (user commits
from V$SYSSTAT
) rate and performance of LGWR I/O (log file parallel write
event), I’ve connected my real time charting tool to plot number of user commits as well as the number and average length of log file parallel write events which is basically write IOPS and average write time. The charts plot data for each instance separately so you can easily compare the impact of reduced CPU clock modulation duty cycle on instance 2 of the database.
I’ve captured the benchmark on the video and it’s available for your viewing pleasure below. It’s better viewed at 1080p in full screen mode.
Please remember that reduced transaction rate that you can see from the very beginning on the node 2 is mainly due to the server processes themselves but not due to slower LGWR IOs. Right until the moment when slower CPUs become saturated enough that CPU scheduling start impacting LGWR and it spends significant amount of time in queue waiting for the CPU cycles. This test is synthetic and might not represent real life transaction rates that you can achieve processing usually more complex transactions but it shows that I/O subsystem of ODA can handle quite high transaction rate just fine.
3 Comments. Leave new
Thanks Alex, I’d love to see a followup on this on how you did the realtime graphs!
[…] you are using Oracle Database Appliance (ODA) then this patch is very important to apply, Alex Gorbachev […]
[…] up to the ODA customers — there is a critical patch 2.1.0.3.1 out that is applied on top of ODA patch bundle 2.1.0.3.0. This patch has an important fix for a bug causing ODA servers to shutdown in some situations when […]