GNU Parallel usage - how get currently passed string to parallel?

I am using GNU parallel and want to understand - how can I get the individual string passed to each parallel command?

As an example, GNU Parallel documentation shows how to move files from the current directory to another:

ls | parallel mv {} destdir

So is there a way to get/print each file individually which was passed to parallel?

Case for parallel processing

I need to do parallel processing of checking multiple sites and record

  • http return code (2xx, 4xx, 5xx)
  • The source URL
  • The ultimate destination URL
  • the curl exit code

Here is the code which does this:

    unset return_code_array
    unset destination_url_array
    unset exit_code_array

    while read -r return_code_var destination_url_var exit_code_var; do

        destination_url_array+=("$destination_url_var")
        exit_code_array+=("$exit_code_var")
        return_code_array+=("$return_code_var")

    done < <(printf '%s\n' "${all_valid_URLs_array[@]}" | parallel -j 20 -k 'curl --max-time 20 -sL -o /dev/null -w "%{response_code} %{url_effective} " {}; printf "%s %s\n" "$?" ')

As a result, I have three arrays and they hold the HTTP return code, ultimate destination URL, and the curl exit code status for each corresponding line for the all_valid_URLs_array entries. I at the same time need to do some processing for each destination_url_var - like comparing if it matches to the source URL, but have no idea how to get the string which was passed to parallels.

Currently, I am running a second loop after the above one for such processing but want to know if I want to accomplish is possible.

Thanks.


In your example 'curl … {}; printf "%s %s\n" "$?" ' (why the second %s?) is a single-quoted piece of shell code. In it you can use {} more than once:

curl … {}; printf "%s %s\n" "$?" {}

Alternatively create a variable and use it as many times as you want. The name of the variable can be descriptive, this is an advantage. There's another advantage: in general what gets substituted for {} can be a long string, substituting it many times may bloat the code parallel will pass to particular shells. IMO it's better to substitute once and let the shell save the string and reuse it:

source_URL={}; curl … "$source_URL"; printf "%s %s\n" "$?" "$source_URL"

In case of GNU parallel it's safe to embed {} in the shell code. It's an exception explicitly mentioned in this answer: Never embed {} in the shell code!. You probably already know this, the remark is for a general audience.

Note you need to adjust your read in the main loop, it now has to read into four variables. This way you will transfer the source URL from the inside of parallel to the main loop where you can compare it to destination_url_var or do whatever you want.

Still in this approach "whatever you want" is not parallelized.

If you capture the output from curl to separate variables inside the shell code run by parallel (instead of just printing it to be captured outside of parallel) then you will be able to do comparison (or whatever you want) there, in parallel. And e.g. printf conditionally. It's up to you where you implement the desired logic, as long as the inside of parallel generates output in the form expected by the outside read.

The shell code passed to parallel still needs to be single-quoted. As it grows, you may need to use (embed) single-quotes in this very code; then quoting will get somewhat complicated and less readable. In such situation consider moving the code to a separate script where you can quote independently. You will invoke it from the main script like this:

while read … ; done < <( … | parallel -j 20 -k 'path/to/separate_script {}' )

Inside the separate_script the string substituted for {} will be available as $1 (don't forget to double-quote it).