How to Prevent Linux 7 Housekeeping from Breaking Oracle

Posted in: Oracle, Technical Track

Linux 7 introduced multiple changes and enhancements to the kernel. Among them was automated housekeeping for temporary files.

In version 6, tmpwatch is available to manage unused files in temp directories. It can be scheduled via cron to automatically keep temp files in check. As of version 7, this functionality moves to systemd-tmpfiles and became a background service largely invisible to users and admins. Like tmpwatch, systemd-tmpfiles removes files and directories based on their access, modified, or created times. The default is to delete files and directories in /tmp that haven’t been accessed for 10 days, and the ones in /var/tmp that haven’t been accessed for 30 days.

That probably seems like a good idea. In many cases, it is. However, it turns out to be a very bad thing for an Oracle server.

Oracle uses the hidden /var/tmp/.oracle directory to store socket files. These are a special type of file that enable interprocess connections (IPC) on a system. They’re created by Oracle when a connection is established via the Listener and used by ASM in addition to CRS, CSS, and EVM daemons.

There are multiple Oracle Support documents that describe the consequences of deleting these files, including 370605.1, 391790.1, 1322234.1, 2099377.1, and 2492508.1. The symptoms are all… well, bad! These include the inability to connect to or start clusterware or a database, ORA-600/ORA-7445 errors, GI crashes, and SCAN failures. Resolutions nearly always requires downtime for the database or cluster.

I just encountered a situation where temp file deletion by the systemd-tmpfiles resulted in excessive virtual memory consumption and CPU load as Oracle ASM processes attempted to read deleted socket files. In this example, I was able to solve things without a restart, but the host did come close to crashing from resource starvation.

We were first alerted to growing virtual memory use on the system. By looking at the current and previous day’s swapping with sar -S, I saw the system normally used 10-12% of its swap space. Over the course of a few hours, that jumped to over 50%. I reviewed the processes using the most swap:

# List info for processes using more than 10k swap space:
for p in $(grep "VmSwap" /proc/*/status | grep -vi " 0 kb" | cut -d: -f1,3 | awk '$2 >= 10000' | sort -n -k 2 | cut -d\/ -f3)
do
egrep "^Name:|^State:|^Pid:|^Vm|switches:" /proc/$p/status
done

This showed the four biggest consumers were ASMCMD daemon processes. Each was consuming over 20G of swap:

Name:   asmcmd daemon
State:  S (sleeping)
Pid:    7533
VmPeak: 23526424 kB
VmSize: 23526424 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:  21691104 kB
VmRSS:  20972500 kB
VmData: 23254580 kB
VmStk:       140 kB
VmExe:      1912 kB
VmLib:     78352 kB
VmPTE:     45168 kB
VmSwap:  2014504 kB
voluntary_ctxt_switches:        6557964
nonvoluntary_ctxt_switches:     63522372

Running strace on these processes showed Failed to connect to ASM instance as well as a number of invalid calls to socket files that didn’t exist. The associated processes were also showing up in the ASMCMD log files. Messages for Invalid file handle for pipe /tmp/asmcmd_fg_7533 were appearing several times a second. Further analysis showed these processes were systematically grabbing additional swap from the system, with accompanying CPU increases. The database eventually began reporting ORA-700 for “[kskvmstatact: excessive swapping observed]”. In this situation, these were foreground sessions and were killed without any impact to the stability or availability of any Oracle components. The memory captured by the processes was immediately returned and load dropped to normal levels.

There was no indication any user had removed files and the directory itself was still there, but all atime/ctime/mtimes on the files were under the 10-day threshold. While not an indictment of systemd-tmpfiles directly, the system is running Linux 7 and the service is active, which are good indicators that it “helpfully” removed the socket files. Here’s another example of what can happen when files in /var/tmp/.oracle are deleted.

Fortunately, there’s an easy fix as described in Document 2498572.1. Simply tell systemd-tmpfiles to ignore the hidden .oracle directories by adding the following to the systemd-tmpfiles configuration file located at /usr/lib/tmpfiles.d/tmp.conf:

x /tmp/.oracle*
x /var/tmp/.oracle*
x /usr/tmp/.oracle*

Then, restart the service with systemctl restart systemd-tmpfiles-clean.timer.

Note that on Exadata, /var/tmp is linked to /tmp which means the 10-day access threshold on /tmp is enforced.

This is a good thing to remember when installing or working with Oracle on RHEL/CentOS/OEL 7 systems.

email

Interested in working with Sean? Schedule a tech call.

4 Comments. Leave new

Martin Berger
August 19, 2019 11:53 am

Thank you for sharing your insights and analysis, Sean.
If you run connection manager on Linux7, also exclude /var/tmp/.oracle_500100 (I’m not sure if this directory is used in all versions, I observed it with 12.2.0.1.181016)

Reply

Thanks Martin! The use of a wildcard (/var/tmp/.oracle*) in the configuration file should address this and other variations.

Reply

Instead of modifying /usr/lib/tmpfiles.d/tmp.conf it is maybe better to put it in /etc/tmpfiles.d/tmp.conf as /usr/lib/ can be overwritten again on update

Reply

This is a great point, thanks for picking up on that subtlety! Not sure why Oracle’s MOS documentation doesn’t address is this way. :/

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *