Delete all but 1000 random files in a directory
I let a data generation script run too long now have 200,000+ files which I need whittle down to around 1000. From the Linux command line, is there an easy way to delete all but 1000 of these files, where the files that would be retained would have no dependence on filename or any other attribute?
Delete all but 1000 random files in a directory
Code:
find /path/to/dir -type f -print0 | sort -zR | tail -zn +1001 | xargs -0 rm
Explanation:
- List all files in
/path/to/dir
withfind
;-
print0
: use\0
(null character) as the line delimiter; so file paths containing spaces/newlines don't break the script
-
- Shuffle the file list with
sort
;-
-z
: use\0
(null character) as delimiter, instead of\n
(a newline) -
-R
: random order
-
- Strip first 1000 lines from the randomized list with
tail
;-
-z
: treat the list as zero-delimited (same as withsort
) -
-n +1001
: show lines starting from 1001 (ie. omit first 1000 lines)
-
-
xargs -0 rm
- remove the remaining files;-
-0
: zero-delimited, again
-
Why it's better than quixotic's solution*:
- Works with filenames containing spaces/newlines.
- Doesn't try to create any directories (which may already exist, btw.)
- Doesn't move any files, doesn't even touch the 1000 "lucky files" besides listing them with
find
. - Avoids missing a file in case the output of
find
doesn't end with\n
(newline) for some reason.
* - credit to quixotic for | sort -R | head -1000
, gave me a starting point.
Use a temporary directory, then find
all your files, randomize the list with sort
, and move the top 1000 of the list into the temporary directory. Delete the rest, then move the files back from the temporary directory.
$ mkdir ../tmp-dir
$ find . -type f | sort -R | head -1000 | xargs -I "I" mv I ../tmp-dir/
$ rm ./*
$ mv ../tmp-dir/* .
If xargs
complains about line length, use a smaller number with head
and repeat the command as needed (ie, change -1000
to -500
and run it twice, or change to -200
and run it 5 times.)
It will also fail to handle filenames that include spaces; as @rld's answer shows, you can use find
's -print0
argument, the -z
arguments to sort
and head
, and -0
with xargs
to ensure proper filename handling.
Finally, if the tmp-dir
already exists, you should substitute a directory name that doesn't exist.