Delete all but 1000 random files in a directory

I let a data generation script run too long now have 200,000+ files which I need whittle down to around 1000. From the Linux command line, is there an easy way to delete all but 1000 of these files, where the files that would be retained would have no dependence on filename or any other attribute?


Delete all but 1000 random files in a directory

Code:

find /path/to/dir -type f -print0 | sort -zR | tail -zn +1001 | xargs -0 rm

Explanation:

  1. List all files in /path/to/dir with find;
    • print0: use \0 (null character) as the line delimiter; so file paths containing spaces/newlines don't break the script
  2. Shuffle the file list with sort;
    • -z: use \0 (null character) as delimiter, instead of \n (a newline)
    • -R: random order
  3. Strip first 1000 lines from the randomized list with tail;
    • -z: treat the list as zero-delimited (same as with sort)
    • -n +1001: show lines starting from 1001 (ie. omit first 1000 lines)
  4. xargs -0 rm - remove the remaining files;
    • -0: zero-delimited, again

Why it's better than quixotic's solution*:

  1. Works with filenames containing spaces/newlines.
  2. Doesn't try to create any directories (which may already exist, btw.)
  3. Doesn't move any files, doesn't even touch the 1000 "lucky files" besides listing them with find.
  4. Avoids missing a file in case the output of find doesn't end with \n (newline) for some reason.

* - credit to quixotic for | sort -R | head -1000, gave me a starting point.


Use a temporary directory, then find all your files, randomize the list with sort, and move the top 1000 of the list into the temporary directory. Delete the rest, then move the files back from the temporary directory.

$ mkdir ../tmp-dir
$ find . -type f | sort -R | head -1000 | xargs -I "I" mv I ../tmp-dir/
$ rm ./*
$ mv ../tmp-dir/* .

If xargs complains about line length, use a smaller number with head and repeat the command as needed (ie, change -1000 to -500 and run it twice, or change to -200 and run it 5 times.)

It will also fail to handle filenames that include spaces; as @rld's answer shows, you can use find's -print0 argument, the -z arguments to sort and head, and -0 with xargs to ensure proper filename handling.

Finally, if the tmp-dir already exists, you should substitute a directory name that doesn't exist.