Compute checksums for a random selection of files whose names are listed in a file

Let's say I have a file named list_of_files.txt where each line corresponds to a file on the disk. For example:

dir1/fileA.ext1
dir1/subdir1/fileB.ext2
fileC.ext3
dir2/fileD.ext4
fileE.ext5

I want to randomly select a number of files from that list and compute cksum or md5sum for them.

I know that I can randomly select say 3 files with shuf -n 3 list_of_files.txt, but how do I make cksum treat them as file names instead of text content?


Solution 1:

If paths in the file are newline-terminated and provided as-is, i.e. if each line is a separate verbatim path, then a shell loop will do:

shuf -n 3 list_of_files.txt | while IFS= read -r pth; do
   cksum "$pth"
done

There is also xargs (see the POSIX specification and more advanced GNU xargs), there is GNU parallel (note non-GNU parallel exists and I'm not referring to it). With the right tool and proper options you can make one cksum process more than one path (spawning less cksum processes is beneficial in general) or run two or more cksum processes in parallel.

To process as few as three files, I may stick to our shell loop because of portability; unless the files are big and I expect three cksum processes running in parallel to be substantially faster than one cksum at a time. I'm not an expert in GNU parallel, but it seems a solution is as simple as:

 shuf -n 3 list_of_files.txt | parallel cksum

By default GNU parallel limits the number of simultaneous jobs by the number of CPU cores. Three or more cores are common nowadays, so the command will probably run three cksum processes in parallel. Formally this is not portable though. Also note that processing three files in parallel means reading three files in parallel. I/O may be a bottleneck and this may reduce the benefit of parallel jobs or even make things worse.

Even then parallel may be useful. Use -j 1 to limit the number of jobs to 1:

 shuf -n 3 list_of_files.txt | parallel -j 1 cksum

The files will be processed sequentially like in our shell loop, but the syntax is simpler. In case of our shell loop you need to know you want IFS= read -r pth, not just read pth; and you need to know you (in many shells) want cksum "$pth", not cksum $pth. The solution with GNU parallel is less error-prone. KISS.

Note xargs by default interprets quotes and backslashes, and it considers spaces as delimiters. This means shuf -n 3 list_of_files.txt | xargs cksum is probably not what you want. Your example will work, but in general you need additional quotes and/or backslashes in the file; xor you need xargs -d '\n' where -d is a non-portable option of GNU xargs. My assumption was "paths in the file are newline-terminated and provided as-is". With this assumption GNU parallel works out of the box (i.e. without additional options), xargs doesn't. With GNU xargs you can do this:

shuf -n 3 list_of_files.txt | xargs -d '\n' cksum

If you can use GNU xargs (to save the day with -d '\n') then probably you can use GNU parallel. If you forget -j 1 when using GNU parallel, the command may perform worse but it will still work. If you forget -d '\n' when using GNU xargs and the pathnames are provided as-is, then it's a bug. That's why I recommended GNU parallel first.

GNU parallel is capable of processing null-terminated strings (the option is -0), so is GNU xargs (-0 instead of -d '\n') and GNU shuf (with -z). Your input file uses newline-terminated lines, but if you ever need to work with pathnames that (may) contain newline characters then changing the terminator in the file and adding the proper options is the way to go.