How to parallelize the scp command?

I need to scp the files from machineB and machineC to machineA. I am running my below shell script from machineA. I have setup the ssh keys properly.

If the files are not there in machineB, then it should be there in machineC. I need to move all the PARTITION1 AND PARTITION2 FILES into machineA respective folder as shown below in my shell script -

#!/bin/bash

readonly PRIMARY=/export/home/david/dist/primary
readonly SECONDARY=/export/home/david/dist/secondary
readonly FILERS_LOCATION=(machineB machineC)
readonly MAPPED_LOCATION=/bat/data/snapshot
PARTITION1=(0 3 5 7 9)
PARTITION2=(1 2 4 6 8)

dir1=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[0]} ls -dt1 "$MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)
dir2=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[1]} ls -dt1 "$MAPPED_LOCATION"/[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9] | head -n1)

length1=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[0]} "ls '$dir1' | wc -l")
length2=$(ssh -o "StrictHostKeyChecking no" david@${FILERS_LOCATION[1]} "ls '$dir2' | wc -l")

if [ "$dir1" = "$dir2" ] && [ "$length1" -gt 0 ] && [ "$length2" -gt 0 ]
then
    rm -r $PRIMARY/*
    rm -r $SECONDARY/*
    for el in "${PARTITION1[@]}"
    do
        scp david@${FILERS_LOCATION[0]}:$dir1/t1_weekly_1680_"$el"_200003_5.data $PRIMARY/. || scp david@${FILERS_LOCATION[1]}:$dir2/t1_weekly_1680_"$el"_200003_5.data $PRIMARY/.
    done
    for sl in "${PARTITION2[@]}"
    do    
        scp david@${FILERS_LOCATION[0]}:$dir1/t1_weekly_1680_"$sl"_200003_5.data $SECONDARY/. || scp david@${FILERS_LOCATION[1]}:$dir2/t1_weekly_1680_"$sl"_200003_5.data $SECONDARY/.
    done
fi

Currently, I am having 5 files in PARTITION1 AND PARTITION2, but in general it will have around 420 files, so that means, it will move the files one by one which I think might be pretty slow. Is there any way to speed up the process?

I am running Ubuntu 12.04

Solution 1:

Parallelizing SCP is counterproductive, unless both sides run on SSD's. The slowest part of SCP is wither the network, in which case parallelizing won't help at all, or disks on either side, which you'll make worse by parallelizing: seek time is going to kill you.

You say machineA is on SSD, so parallelizing per machine should be enough. The simplest way to do that is to wrap the first forloop in a subshell and background it.

( for el in "${PARTITION1[@]}"
do
    scp david@${FILERS_LOCATION[0]}:$dir1/t1_weekly_1680_"$el"_200003_5.data $PRIMARY/. || scp david@${FILERS_LOCATION[1]}:$dir2/t1_weekly_1680_"$el"_200003_5.data $PRIMARY/.
done ) &

Solution 2:

You could use GNU Parallel to help you run multiple tasks in parallel.

However, in your situation, it would appear that you're establishing a separate secure connection for each file transfer, which is likely quite inefficient indeed, especially if the other machines are not on a local network.

The best approach would be to use a tool that specifically does batch file transfer — for example, rsync, which can work over plain ssh, too.

If rsync is not available, as an alternative, you could use zip, or even tar and gzip or bzip2, and then scp the resulting archives (then connect with ssh, and do the unpacking).

How to parallelize the scp command?

Solution 1:

Solution 2:

Related

Recent Posts