Why are not all files compressed and how to improve the solution

I have a folder with about 20K files. The files are named according to the pattern xy_{\d1,5}_{\d4}\.abc, e.g xy_12345_1234.abc. I wanted to compress the first 10K of them using this command:

ls | sort -n -k1.4,1.9 | head -n10000 | xargs tar -czf xy_0_10000.tar.gz

however the resulting file had only about 2K files inside.

ls | sort -n -k1.4,1.9 | head -n10000 | wc -l however returns 10000, as expected.

It seems to me that I am misunderstanding something basic here...

I am using zsh 5.0.2 on Linux Mint 17.1, GNU tar 1.27.1

EDIT:

forking as suggested by @Archemar sounds very plausible, with the latest fork overwriting the resulting file - the file contains the 'tail' of the files - 7773 to 9999.

result of xargs --show-limit: Your environment variables take up 3973 bytes POSIX upper limit on argument length (this system): 2091131 POSIX smallest allowable upper limit on argument length (all systems): 4096 Maximum length of command we could actually use: 2087158 Size of command buffer we are actually using: 131072

replacing -c with -r or -u did not work in my case. The error message was tar: Cannot update compressed archives

using both -r and -u is invalid and fails with tar: You may not specify more than one '-Acdtrux', '--delete' or '--test-label' option

replacing -c with -a seems to be invalid as well and fails with the same tar: You must specify one of the '-Acdtrux', '--delete' or '--test-label' options though I dont recognize the issue azf and Acdtrux seem disjunct to me.

EDIT 2:

-T looks like a good way, I have also found an example here.

However when I try

ls | sort -n -k1.4,1.9 | head -n10000 | tar -czf xy_0_10000.tar.gz -T - i get tar: option requires an argument -- 'T'

well, perhaps the filenames dont reach tar? But it looks like they, do because when I execute

ls | sort -n -k1.4,1.9 | head -n10000 | tar --null -czf xy_0_10000.tar.gz -T - i get tar: xy_0_.ab\nxy_1_...<the rest of filenames separated by literal \n>...998.ab Cannot stat: File name too long

So why is tar not seeing the filenames?


you've hit xargs limit ?

xargs --show-limit

try :

  • create a dummy .tgz file tar czf xy_0_10000.tar.gz /hello/world
  • replace -czf by -Azf

when xarg hit its limit, it will fork command, so command you ultimatly ran was

  tar czf xy_0_10000.tar.gz file1 file2 .... file666
  tar czf xy_0_10000.tar.gz file667 file668 ... file1203
  tar czf xy_0_10000.tar.gz file1024 ... file2000

as each tar overide previous one, you sould be getting only last tar c run.

Edit:

1) according to man tar on unbuntu, -a and -r seems equivalent append is done by (either) -A, --catenate, --concatenate

2) zip (not gzip) can be used to add file, maybe a gzip option will do the trick. (use | xargs zip -qr xy_0_0000.zip , this will result in a zip file, not a .tar.gz however)

3) to use @rsanchez's solution
It is important to add option to tar in a proper way, try

ls | sort -n -k1.4,1.9 | head -n10000 |tar -czf xy_0_10000.tar.gz -T -

where - -T - mean use option -T and use - as argument to -T (you could have generate a list of file in /tmp/foo.lst , then use -T /tmp/foo.lst )


There's no need for xargs. If you directly give tar the -T - option it will read the filenames from standard input.

For instance:

... | tar -T - -czf xy_0_10000.tar.gz

I want to complement the two other answers with a zsh solution, which neither parses ls, nor needs xargs. However, I am not sure right now, if it suffers also from the limitation of the command line length.

  1. Define a function which generates your desired sorting key by modifying $REPLY.

    sortkey() { REPLY=${REPLY[4,9]} }
    

    This is equivalent to your sort -n -k1.4,1.9

  2. Generate an array $files with the filenames sorted with the above function:

    files=(*(o+sortkey))
    

    This is equivalent to ls | sort -n -k1.4,1.9

  3. Return the first 10 000 files with

    ${files[0,9999]}
    

    This is equivalent to ls | sort -n -k1.4,1.9 | head -n10000

So, all in all this should do the trick:

sortkey() { REPLY=${REPLY[4,9]} }
files=(*(o+sortkey))
tar -czf xy_0_10000.tar.gz ${files[0,9999]}