No arguments here – use xargs for file management performance

Posted in: Open Source, Oracle, Technical Track

 

Database Administrators and System Administrators have this in common: managing a large number of log files is just part of the job on Linux systems.

Tools such as logrotate significantly simplify the file management task for routinely created log files. Even so, there are still many ‘opportunities’ to exercise your command line fu to manage thousands or millions of files.  These may be files that need to be moved, removed or searched.

When the files span multiple directories the find command is often used. The following command for instance will find all log files of a certain age and size and remove them.

find . -name "*.log" -size +1M -exec rm  {} \;

 

For a few files this will work just fine, but what happens if the number of files to be processed is several thousands, or even millions?

The xargs Difference

Let’s first create 200k files to use for testing. These files will all be empty, there is no need for any content for these tests.

The script create.sh can be used to create the directories and empty files.

As it takes some time to create the files, we will not use the rm command here, but rather just the file command. The command will be timed as well.

#  time find . -type f -name file_\* -exec file {} \; >/dev/null
real    1m24.764s
user    0m4.624s
sys     0m12.581s

Perhaps 1 minute and 24 seconds seems to be a reasonable amount of time to process so many files.

It isn’t.

Let’s use a slightly different method to process these files, this time by adding xargs in a command pipe.

 time find . -type f -name file_\* | xargs file >/dev/null
real    0m0.860s
user    0m0.456s
sys     0m0.432s

Wait, what?!  0.8 seconds? Can that be correct?

Yes, it is correct. Using xargs with find can greatly reduce the resources needed to iterate through files.

How then, is is possible for the command that used xargs to complete so much faster than the command that did not use xargs?

When iterating through a list of files with the -exec  argument to the find command, a new shell is forked for each execution of find.

For a large number of files this requires a lot of resources.

For demonstration purposes I will be using the ‘file’ command rather than ‘rm’.

Could it be that the xargs method may have benefited from the caching effects of running the first find command?

Could be – let’s run find … -exec again and see if it benefits from caching.

# time find . -type f -name file_\* -exec file {} \; >/dev/null
real    1m25.722s
user    0m3.900s
sys     0m11.893s

Clearly any caching didn’t help find … -exec.

Why Is xargs Fast?

Why is the use of xargs so much faster than find? In short it is due to find starting a new process for each file it finds when the -exec option is used.

The command ‘find | xargs’ was wrapped in a shell script find-xargs.sh to facilitate the use of strace.

The find-xargs.sh script takes 2 arguments; the number of files to pipe to xargs and the number files that xargs should send to the file command for each invocation of file.

The number of files to process is controlled by piping the output of find to head.

The xargs –max-args argument is used to control how many arguments are sent to each invocation of find.

We can now use strace with the -c option; -c accumulates a count of all calls along with timing information.

Calling the script to run for the first 10000 files, with 1000 files sent to each invocation of find:

# strace -c -f  ./find-xargs.sh 10000 1000
MAX_FILES: 10000
MAX_ARGS: 1000
Process 11268 attached
Process 11269 attached
…
Process 11267 resumed
Process 11269 detached
% time     seconds  usecs/call     calls    errors syscall
—— ———– ———– ——— ——— —————-
 99.55    0.080017        5001        16         2 wait4
  0.35    0.000280           0     12372           newfstatat
  0.09    0.000074           0       208           getdents
  0.01    0.000006           0     10000           lstat
  0.00    0.000000           0       199           read
  0.00    0.000000           0       276         1 write
  0.00    0.000000           0       384        91 open
  0.00    0.000000           0       313         4 close
  0.00    0.000000           0        68        42 stat
  0.00    0.000000           0       189           fstat
  0.00    0.000000           0         5         1 lseek
  0.00    0.000000           0       209           mmap
  0.00    0.000000           0        71           mprotect
  0.00    0.000000           0        37           munmap
  0.00    0.000000           0        72           brk
  0.00    0.000000           0        41           rt_sigaction
  0.00    0.000000           0        80           rt_sigprocmask
  0.00    0.000000           0         2           rt_sigreturn
  0.00    0.000000           0        13        12 ioctl
  0.00    0.000000           0        77        77 access
  0.00    0.000000           0         2           pipe
  0.00    0.000000           0         6           dup2
  0.00    0.000000           0         1           getpid
  0.00    0.000000           0        14           clone
  0.00    0.000000           0        14           execve
  0.00    0.000000           0         2           uname
  0.00    0.000000           0         4         1 fcntl
  0.00    0.000000           0       206           fchdir
  0.00    0.000000           0         5           getrlimit
  0.00    0.000000           0         1           getuid
  0.00    0.000000           0         1           getgid
  0.00    0.000000           0         1           geteuid
  0.00    0.000000           0         1           getegid
  0.00    0.000000           0         1           getppid
  0.00    0.000000           0         1           getpgrp
  0.00    0.000000           0        14           arch_prctl
  0.00    0.000000           0         2         1 futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
—— ———– ———– ——— ——— —————-
100.00    0.080377                 24910       232 total

The largest chunk of time was spent in the wait4 system call. These are waits on execve, of which there were 14.

Of the 14 calls to execve, there was 1 each for the use of bash (the script itself), find, head and xargs, leaving 10 calls to be consumed by file.

The following command can be used if you would like to try this yourself:

strace  -f -e trace=execve  ./find-xargs.sh 10000 1000  2>&1 | grep execve

What happens when the same type of test is run against find with the -exec argument?

There is no method (that I can find in the man page anyway) by which we can limit the number of files that are sent to the program specified in the -exec argument of find.

We can still learn what is going on, it is just necessary to wait 1.5 minutes for the command to complete.

# strace -c -f find . -type f -name file_\*  -exec file {} \; >/dev/null
% time     seconds  usecs/call     calls    errors syscall
—— ———– ———– ——— ——— —————-
 96.80    4.101094          21    200000           wait4
  0.69    0.029305           0    200000           clone
  0.46    0.019278           0   2602351   1400007 open
  0.44    0.018833           0    600001           munmap
  0.31    0.013108           0   3200017           mmap
  0.30    0.012715           0   1401173           fstat
  0.16    0.006979           0   1200006   1200006 access
  0.15    0.006543           0   1202345           close
  0.15    0.006288           0   1000004    600003 stat
  0.13    0.005632           0   1000004           read
  0.12    0.004981           0    200000           lstat
  0.09    0.003704           0    600026           brk
  0.07    0.003016           0   1000009           mprotect
  0.07    0.002776           0    200001    200000 ioctl
  0.03    0.001079           0    201169           newfstatat
  0.02    0.000806           0      2347           getdents
  0.01    0.000600           0    200000           write
  0.00    0.000003           0    200001           arch_prctl
  0.00    0.000002           0    202341           fchdir
  0.00    0.000000           0         3           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0    400001    200000 execve
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         1           fcntl
  0.00    0.000000           0         2           getrlimit
  0.00    0.000000           0         2         1 futex
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
—— ———– ———– ——— ——— —————-
100.00    4.236742              15811808   3600017 total

 

You may have noticed there are twice as many calls to execve than there were files to process.

This is due to something referenced in the comments of find-xarg.sh. Unless a full path name is specified when running a command, the PATH variable is searched for that command. If the command is not found by the first invocation of execve, then another attempt is made the next directory in PATH.

The following example shows the difference between using the command name only, and then using the fully pathed name of the file command.

# strace -e trace=execve -f find -maxdepth 1 -type f -name \*.sh  -exec file {} \;  2>&1 | grep execve
execve("/usr/bin/find", ["find", "-maxdepth", "1", "-type", "f", "-name", "*.sh", "-exec", "file", "{}", ";"], [/* 83 vars */]) = 0
[pid  9267] execve("/usr/local/bin/file", ["file", "./find-xargs.sh"], [/* 83 vars */]) = -1 ENOENT (No such file or directory)
[pid  9267] execve("/usr/bin/file", ["file", "./find-xargs.sh"], [/* 83 vars */]) = 0
[pid  9268] execve("/usr/local/bin/file", ["file", "./create.sh"], [/* 83 vars */]) = -1 ENOENT (No such file or directory)
[pid  9268] execve("/usr/bin/file", ["file", "./create.sh"], [/* 83 vars */]) = 0
[pid  9269] execve("/usr/local/bin/file", ["file", "./distribution.sh"], [/* 83 vars */]) = -1 ENOENT (No such file or directory)
[pid  9269] execve("/usr/bin/file", ["file", "./distribution.sh"], [/* 83 vars */]) = 0
# strace -e trace=execve -f find -maxdepth 1 -type f -name \*.sh  -exec /usr/bin/file {} \;  2>&1 | grep execve
execve("/usr/bin/find", ["find", "-maxdepth", "1", "-type", "f", "-name", "*.sh", "-exec", "/usr/bin/file", "{}", ";"], [/* 83 vars */]) = 0
[pid  9273] execve("/usr/bin/file", ["/usr/bin/file", "./find-xargs.sh"], [/* 83 vars */]) = 0
[pid  9274] execve("/usr/bin/file", ["/usr/bin/file", "./create.sh"], [/* 83 vars */]) = 0
[pid  9275] execve("/usr/bin/file", ["/usr/bin/file", "./distribution.sh"], [/* 83 vars */]) = 0

Too Much Space

Regardless of how bad a practice it may be, there will be times that file and directory names may contain space characters. Literal spaces, newlines and tabs can all play havoc with file name processing;  xargs has you covered.

Two files are created to demonstrate:

 

# touch ‘this filename has spaces’ this-filename-has-no-spaces
# ls -l
total 0
-rw-r–r– 1 jkstill dba 0 Apr 15 09:28 this filename has spaces
-rw-r–r– 1 jkstill dba 0 Apr 15 09:28 this-filename-has-no-spaces

What happens when the output of find it piped to xargs?

 

 find . -type f | xargs file
./this-filename-has-no-spaces: empty
./this:                        ERROR: cannot open `./this’ (No such file or directory)
filename:                      ERROR: cannot open `filename’ (No such file or directory)
has:                           ERROR: cannot open `has’ (No such file or directory)
spaces:                        ERROR: cannot open `spaces’ (No such file or directory)

The spaces in one of the filenames causes xargs to treat each word in the filename as a separate file.

Because of this it is a good idea to use the -print0 and -0 args as seen in the following example. These arguments change the output terminator of find to the null character, as well as changing the input terminator of xargs to the null character to deal with space characters in file and directory names.

 

 find . -type f -print0  | xargs -0 file
./this-filename-has-no-spaces: empty
./this filename has spaces:    empty

There is quite a bit more to xargs than this, I would encourage you to read the man page and experiment with the options to better learn how to make use of it.

Hope For find

For many versions of GNU find there is an easy modification that can be made to the command line that will cause the -exec option to emulate the method xargs uses pass input to a command.

Simply by changing -exec command {} \; to  -exec command {} +, the find command will execute much faster than previously.

Here the find command has matched the performance of xargs when processing 200k files:

 

# time find . -type f -name file_\*  -exec file {} +  | wc
 200000  400000 8069198
real    0m0.801s
user    0m0.436s
sys     0m0.404s

This may mean a quick and simple change to maintenance scripts can yield a very large increase in performance.

Does this mean there is no longer a need for xargs?  Not really, as xargs offers levels of control over the input to piped commands that simply are not available in the find command.

If you’ve never used xargs, you should consider doing so, as it can reduce the resource usages on your systems and decrease the runtime for maintenance tasks.

email

Author

Interested in working with Jared? Schedule a tech call.

About the Author

Oracle experience: started with Oracle 7.0.13 Programming Experience: Perl, PL/SQL, Shell, SQL Also some other odds and ends that are no longer useful Systems: Networking, Storage, OS to varying degrees. Have fond memories of DG/UX

No comments

Leave a Reply

Your email address will not be published. Required fields are marked *