How to parallelize for-loop in bash limiting number of processes
I have a bash script similar to:
NUM_PROCS=$1
NUM_ITERS=$2
for ((i=0; i<$NUM_ITERS; i++)); do
python foo.py $i arg2 &
done
What's the most straightforward way to limit the number of parallel processes to NUM_PROCS? I'm looking for a solution that doesn't require packages/installations/modules (like GNU Parallel) if possible.
When I tried Charles Duffy's latest approach, I got the following error from bash -x:
+ python run.py args 1
+ python run.py ... 3
+ python run.py ... 4
+ python run.py ... 2
+ read -r line
+ python run.py ... 1
+ read -r line
+ python run.py ... 4
+ read -r line
+ python run.py ... 2
+ read -r line
+ python run.py ... 3
+ read -r line
+ python run.py ... 0
+ read -r line
... continuing with other numbers between 0 and 5, until too many processes were started for the system to handle and the bash script was shut down.
Solution 1:
bash
4.4 will have an interesting new type of parameter expansion that simplifies Charles Duffy's answer.
#!/bin/bash
num_procs=$1
num_iters=$2
num_jobs="\j" # The prompt escape for number of jobs currently running
for ((i=0; i<num_iters; i++)); do
while (( ${num_jobs@P} >= num_procs )); do
wait -n
done
python foo.py "$i" arg2 &
done
Solution 2:
GNU, macOS/OSX, FreeBSD and NetBSD can all do this with xargs -P
, no bash versions or package installs required. Here's 4 processes at a time:
printf "%s\0" {1..10} | xargs -0 -I @ -P 4 python foo.py @ arg2
Solution 3:
As a very simple implementation, depending on a version of bash new enough to have wait -n
(to wait until only the next job exits, as opposed to waiting for all jobs):
#!/bin/bash
# ^^^^ - NOT /bin/sh!
num_procs=$1
num_iters=$2
declare -A pids=( )
for ((i=0; i<num_iters; i++)); do
while (( ${#pids[@]} >= num_procs )); do
wait -n
for pid in "${!pids[@]}"; do
kill -0 "$pid" &>/dev/null || unset "pids[$pid]"
done
done
python foo.py "$i" arg2 & pids["$!"]=1
done
If running on a shell without wait -n
, one can (very inefficiently) replace it with a command such as sleep 0.2
, to poll every 1/5th of a second.
Since you're actually reading input from a file, another approach is to start N subprocesses, each of processes only lines where (linenum % N == threadnum)
:
num_procs=$1
infile=$2
for ((i=0; i<num_procs; i++)); do
(
while read -r line; do
echo "Thread $i: processing $line"
done < <(awk -v num_procs="$num_procs" -v i="$i" \
'NR % num_procs == i { print }' <"$infile")
) &
done
wait # wait for all the $num_procs subprocesses to finish
Solution 4:
Are you aware that if you are allowed to write and run your own scripts, then you can also use GNU Parallel? In essence it is a Perl script in one single file.
From the README:
= Minimal installation =
If you just need parallel and do not have 'make' installed (maybe the system is old or Microsoft Windows):
wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel chmod 755 parallel cp parallel sem mv parallel sem dir-in-your-$PATH/bin/
seq $2 | parallel -j$1 python foo.py {} arg2
parallel --embed
(available since 20180322) even makes it possible to distribute GNU Parallel as part of a shell script (i.e. no extra files needed):
parallel --embed >newscript
Then edit the end of newscript
.