Mind your rdbms/audit on 12c

Posted in: Oracle, Technical Track

Recently we’ve ran into an interesting question from one of our clients. They were seeing messages of the following form in syslog:


"Aug 11 11:56:02 ***** kernel: EXT3-fs warning (device ******): ext3_dx_add_entry: Directory index full!"

I haven’t encountered this before, and did a bit of research. My initial suspicion ended up being correct, and it was due to too many files being created, somewhere in that file system. I had a look around, and eventually checked out the ORACLE_HOME of the ASM / Grid Infrastructure software, which is running version 12.1.0.2 on that host.

I snooped around using du -sh to check which directories or sub-directories might be the culprit, and the disk usage utility came to a halt after the “racg” directory. Next in line would be “rdbms”. The bulb lit up somewhat brighter now. Entering the rdbms/audit directory, I issued the common command you would if you wanted to look at a directories contents: “ls”.

Five minutes later, there was still no output on my screen. Okay, we found the troublemaker. So we’re now being faced with a directory that has potentially millions of files in it. Certainly we all are aware that “rm” isn’t really able to cope with a situation like this. It would probably run for a couple minutes until it’s done parsing the directory index, and then yell “argument list too long” at us. Alternatively, we could use find, combined with -exec (bad idea), -delete, or even pipe into rm using xargs. Looking around a bit on the good ol’ friend google, I came across this very interesting blog post by Sarath Pillai.

I took his PERL one-liner, adjusted it a wee bit since I was curious how many files we actually got in there and ran it on a sandbox system with a directory with 88’000 files in it:


perl -e 'my $i=0;for(<*>){$i++;((stat)[9]<(unlink))} print "Files deleted: $i\n"'

It completed in 4.5 seconds. That’s pretty good. In Sarath’s tests he was able to delete half a million files in roughly a minute. Fair enough.

After getting the OK from the client, we ran it on the big beast. It took 10 minutes.


Files deleted: 9129797

9.1 million files. Now here comes the interesting bit. This system has been actively using 12.1.0.2 ASM since May 6th, 2015. That’s only 3 months. That translates to 3 million files per month. Is this really a desirable feature? Do we need to start running Hadoop just to be able to mine the data in there?

Looking at some of the files, it seems ASM is not only logging user interactions there, but also anything and everything done by any process that connects to ASM.

As I was writing this, I happened to take another peek at the directory.


[[email protected] audit]$ ls -1 | wc -l
9134657

Remember those numbers from before? Three million a month? Double that.

I suspect this was due to the index being full, and Linux has now re-populated the index with the next batch. Until it ran full again.

A new syslog entry just created at the same time seems to confirm that theory:

Aug 12 00:09:11 ***** kernel: EXT3-fs warning (device ******): ext3_dx_add_entry: Directory index full!

After running the PERL one-liner again, we deleted another vast amount of files:


Files deleted: 9135386

It seems that the root cause is the added time stamp to the file names of the audit files that Oracle writes in 12.1. The file names are much more unique, which gives Oracle the opportunity to generate so many more of them. Where in previous versions, with an adequately sized file system you’d probably be okay for a year or more; on 12.1.0.2, on an active database (and our big beast is very active) you have to schedule a job to remove them, and ensure it runs frequently (think 18+ million files in 3 months to put “frequently” into perspective).

email

Author

Want to talk with an expert? Schedule a call with our team to get the conversation started.

About the Author

Stefan is a passionate Oracle database researcher and has been focusing on understanding the Oracle database at its core for more than 10 years. His primary focus lies in the performance of the database, its security aspects, and using its various features to solve problems and architect efficient database designs. Stefan is also an avid technical writer and has written large documentation sets from scratch for massive Oracle projects.

6 Comments. Leave new

Nice hint, thank you for sharing.

Reply

there is a minor error in your perl one-liner/your blog-settings: for() must not be empty.
i’ll guess, your blog-software tries to interpret the content of that brackets as html

Reply
Stefan Knecht
April 6, 2016 7:23 am

Thanks Buddah for pointing that out! Good catch and you’re right, WordPress has stripped those.

Reply
Adrian J Shepherd
May 12, 2016 10:32 am

HI Stefan

We had this issue in Production a few days ago… Here’s an email i sent out…

Finally got to the bottom of this.

It doesn’t help that we have rolled out an 11G database on a 12C stack…untested. Not sure why we build. It’s on Linux too…Yuck.

The issue is a debate between a bug and “functionality”. In the set of circumstances above, when using Oracle Restart , there will be a trace file generated every seconds. This is when the Restart notifier connects to ASM as SYS to perform it’s check.

How often is this check configured ?

crsctl stat res ora.asm -p | grep CHECK

CHECK_INTERVAL=1
CHECK_TIMEOUT=30
OFFLINE_CHECK_INTERVAL=0

So, every second…. Which is the symptom seen by Mark and myself. Millions of trace files in a short period.

We do not see this in the 12C build, or on DIY.COM , the original 11G ASM build – housekeeping was implemented , and effectively masked this symptom.
The 11G build standard , if it exists , does not include 12C style Grid housekeeping.

Solution – modify the resource and reduce the check interval

$ crsctl modify resource ora.asm -attr START_DEPENDENCIES=”hard(ora.cssd)”
CRS-4995: The command ‘Modify resource’ is invalid in crsctl. Use srvctl for this command.

Huh ??? Yes it is…

$ crsctl modify resource –help

Usage:
crsctl modify resource -attr “[,…]” [-f] [-delete] [-i]
: {= | @=}
: {@SERVERNAME()[@DEGREEID()] |
@CARDINALITYID()[@DEGREEID()] }
where
resName Modify named resource
attrName Attribute name
value Attribute value
server Server name
cid Resource cardinality ID
did Resource degree ID
-f Force option
-delete Delete named attribute
-i Fail if request cannot be processed immediately

Several Googles later….Use the unsupported flag…

$ crsctl modify resource ora.ssm -attr “CHECK_INTERVAL=60” -unsupported
CRS-2613: Could not find resource ‘ora.ssm’.
CRS-4000: Command Modify failed, or completed with errors.

Annoying… Clearly the resource is there, Ive just checked its config

OK, fall back, disable restart. It’s not tested at KITS. We need to take this into the lab and configure / test the DBMS audit purge function to support the default restart check settings.

CAUSE
Oracle Restart, just like the Grid Infrastructure daemons, check the registered services availability with a frequency that is set by a parameter named CHECK_INTERVAL and, as documented :

“The lower the value for this parameter for a resource (the meaningful settings would be for the ASM/database instances, since the only ones capable of generating audit files), the more frequent the check is done and the higher the number of audit files.”

We cant entirely stop audit files. We can only manage them by reducing the frequency and implementing housekeeping. The documented fix for setting the interval does not work , so we disable restart.

crsctl modify resource ora.asm -attr START_DEPENDENCIES=”hard(ora.cssd)”
$ crsctl modify resource ora.asm -attr START_DEPENDENCIES=”hard(ora.cssd)”
CRS-4995: The command ‘Modify resource’ is invalid in crsctl. Use srvctl for this command.

What again ??? Is anything supported…another fix that doesn’t work.
The hint to use srvctl instead is just plain rubbish.

Let me try that unsupported thingy again…

$ crsctl modify resource ora.asm -attr START_DEPENDENCIES=”hard(ora.cssd)” -unsupported

Kerching !!

Reply
Kirill Richine
August 26, 2016 4:06 pm

Nice observation.

One workaround is to send the audit entries to syslog instead of letting Oracle create numerous OS files.

audit_syslog_level string LOCAL#.NOTICE

this works for ASM as well as rdbms instance.

The audit entries will then start showing up in the messages file generated by syslog, not in separate files.

These entries can then be separated by defining local#.notice line in syslog.conf, to go to [one] separate file instead of the central messages file, if desired.

Syslog (or logrotate in linux) also provides regular log rotation/retention facility.

Reply

From my perspective this is what failed the first time:

“$ crsctl modify resource ora.ssm -attr “CHECK_INTERVAL=60” -unsupported”

See the ora.ssm and not ora.asm?

Just a tought, cheers.

Reply

Leave a Reply

Your email address will not be published. Required fields are marked *