Running a limited number of child processes in parallel in bash? [duplicate]

I have a large set of files for which some heavy processing needs to be done. This processing in single threaded, uses a few hundred MiB of RAM (on the machine used to start the job) and takes a few minutes to run. My current usecase is to start a hadoop job on the input data, but I've had this same problem in other cases before.

In order to fully utilize the available CPU power I want to be able to run several those tasks in paralell.

However a very simple example shell script like this will trash the system performance due to excessive load and swapping:

find . -type f | while read name ; 
do 
   some_heavy_processing_command ${name} &
done

So what I want is essentially similar to what "gmake -j4" does.

I know bash supports the "wait" command but that only waits untill all child processes have completed. In the past I've created scripting that does a 'ps' command and then grep the child processes out by name (yes, i know ... ugly).

What is the simplest/cleanest/best solution to do what I want?

Edit: Thanks to Frederik: Yes indeed this is a duplicate of How to limit number of threads/sub-processes used in a function in bash The "xargs --max-procs=4" works like a charm. (So I voted to close my own question)

Solution 1:

I know I'm late to the party with this answer but I thought I would post an alternative that, IMHO, makes the body of the script cleaner and simpler. (Clearly you can change the values 2 & 5 to be appropriate for your scenario.)

function max2 {
   while [ `jobs | wc -l` -ge 2 ]
   do
      sleep 5
   done
}

find . -type f | while read name ; 
do 
   max2; some_heavy_processing_command ${name} &
done
wait

Solution 2:

#! /usr/bin/env bash

set -o monitor 
# means: run background processes in a separate processes...
trap add_next_job CHLD 
# execute add_next_job when we receive a child complete signal

todo_array=($(find . -type f)) # places output into an array

index=0
max_jobs=2

function add_next_job {
    # if still jobs to do then add one
    if [[ $index -lt ${#todo_array[*]} ]]
    # apparently stackoverflow doesn't like bash syntax
    # the hash in the if is not a comment - rather it's bash awkward way of getting its length
    then
        echo adding job ${todo_array[$index]}
        do_job ${todo_array[$index]} & 
        # replace the line above with the command you want
        index=$(($index+1))
    fi
}

function do_job {
    echo "starting job $1"
    sleep 2
}

# add initial set of jobs
while [[ $index -lt $max_jobs ]]
do
    add_next_job
done

# wait for all jobs to complete
wait
echo "done"

Having said that Fredrik makes the excellent point that xargs does exactly what you want...

Solution 3:

With GNU Parallel it becomes simpler:

find . -type f | parallel  some_heavy_processing_command {}

Learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Solution 4:

I think I found a more handy solution using make:

#!/usr/bin/make -f

THIS := $(lastword $(MAKEFILE_LIST))
TARGETS := $(shell find . -name '*.sh' -type f)

.PHONY: all $(TARGETS)

all: $(TARGETS)

$(TARGETS):
        some_heavy_processing_command $@

$(THIS): ; # Avoid to try to remake this makefile

Call it as e.g. 'test.mak', and add execute rights. If You call ./test.mak it will call the some_heavy_processing_command one-by-one. But You can call as ./test.mak -j 4, then it will run four subprocesses at once. Also You can use it on a more sophisticated way: run as ./test.mak -j 5 -l 1.5, then it will run maximum 5 sub-processes while the system load is under 1.5, but it will limit the number of processes if the system load exceeds 1.5.

It is more flexible than xargs, and make is part of the standard distribution, not like parallel.