Using xargs to do parallel processing // Random stuff from a random mind

Previously posted on blog.labrat.info on April 29, 2010 and found by Hacker News on December 14, 2012.

Going though some log files the other day I came to a realization. Most modern machine are muti-processor machines and they are rarely used as such. I had a boat-load of log files that had been archived I had to go though. It was taking forever to un-compress each one by one. At first I thought I would just make a loop, send all the gunzip processes to the background and wait for them to be done.

for i in `ls *.gz`; do
  gunzip $i &
done
wait

The problem with this approach is you can very quickly overwhelm the system if the number of compressed files is larger than the number of cores on the machine.

My next idea was to add a counter to the loop and then count to the number of cores (4 in this case) and then wait for all processes to be done.

COUNT=0
for i in `ls *.gz`; do
  gunzip $i &
  ((( COUNT = $COUNT  + 1 )))
  if [ $COUNT -eq 4 ]; then
    COUNT=0
    wait
  fi
done

It does the trick. It only lets 4 processes run at the same time. It does have it’s problems though. All four of the gunzip processes have to finish before the next batch begin. This would mean that if there is one big file in each one of the batches, all other cores will remain idle until it’s done. That’s no good, plus that just seems like a mess of code though.

Yes, the next step would be to capture the PIDs for the background processes and hold them in an array and then keep track of who’s still running but now this has turned into a CS 101 project.

Can we do this with one line and 2 commands? I believe so, but first a little history. xargs is a command that was designed to overcome a limitation that UNIX system program have . The limitation is there is a fixed number of arguments a program can take. This is set by the OS and there’s not much you can say or do about it. To get the value run getconf ARG_MAX.

On my 32 bit Linux machine the limit = 131072
On my 32 bit OSX 10.6.3 Macbook Pro the limit = 262144
On my 64 bit FreeBSD machine the limit = 262144

These limits were much older in the “good-old-days” so that’s when xargs came in handy. I really can’t think of an example now a days that is not contrived to need that many arguments, so I will leave that as an exercise to the reader.

xargs takes the parameters you want to feed into a function and chops them up into pieces that can fit in the number of arguments specified by the system. This means that multiple instances of the program called by xargs will run at the same time. The use can control the size of the peaces used by xargs. That’s where this tool comes in handy.

xargs can take the -P flag to tell it how many processes to spawn. It can also take the -n which means how many rows from the input should go into each process.

So now we can do what we were trying to do with the previous scripts in one line. All that is needed is to tell xargs to spawn the same number of processes as there are cores in the system and take one row per process.

ls *.gz | xargs -P 4 -n 1 gunzip

The beauty here is xargs takes care of the scheduling and everything, so the entire system is loaded, but not overloaded, while it uncompresses all the files. That’s very cool!

After finding this out I keep finding uses for this command. If there is a sed transformation I need to do on a long file or anything computationally expensive I am doing this way. Or if you need to grep though a large directory structure it can be parallelized! Create 4 processes each one looking at 10 files.

find /home/CVSROOT -type f | xargs -P 4 -n 10 grep -H 'string-to-search'

The possibilities are really endless here.