How to delete smallest file if names are duplicate
I would like to clean up a folder with videos. I have a bunch of videos that were downloaded with different resolutions, so each file will start with the same name and then end with "_480p" or "_720p" etc.
I just want to keep the largest file of each such set.
So I am looking for a way to delete files based on
- check if name before "_" is identical
- if true, then delete all files except largest one
Thinking of a flexible and fast way to approach the problem, you can gather a list of files ending in "[[:digit:]]+p"
and then a quick way to parse the names is to provide them on stdin
to awk
and let awk
index an array with the file prefix (path + part of name before '_'
) so it will be unique for files allowing the different format size to be obtained and stored at that index.
Then it's a simply matter of comparing the stored resolution number for the file against the current file number and deleting the lesser of the two.
Your find command to locate all files in the directory below the current, recursively, could be:
find ./tmp -type f -regex "^.*[0-9]+p$"
What I would do is then pipe the filename output to a short awk
script where an array stores the last seen number for a given file prefix, and then if the current record (line) resolution number if bigger than the value stored in the array, a filename using the array number is created and that file deleted with system()
using rm filename
. If the current line resolution number is less than what is already stored in the array for the file, you simply delete the current file.
You can do that as:
#!/usr/bin/awk -f
BEGIN { FS = "/" }
{
num = $NF # last field holds number up to 'p'
prefix = $0 # prefix is name up to "_[[:digit:]]+p
sub (/^.*_/, "", num) # isolate number
sub (/p$/, "", num) # remove 'p' at and
sub (/_[[:digit:]]+p$/, "", prefix) # isolate path and name prefix
if (prefix in a) { # current file in array a[] ?
rmfile = $0 # set file to remove to current
if (num + 0 > a[prefix] + 0) { # current number > array number
rmfile = prefix "_" a[prefix] "p" # for remove filename from array
a[prefix] = num # update array with higher num
}
system ("rm " rmfile); # delete the file
}
else
a[prefix] = num # if no num for prefix in array, store first
}
(note: the field-separator splits the fields using the directory separator so you have all file components to work with.)
Example Use/Output
With a representative set of files in a tmp/
directory below the current, e,g.
$ ls -1 tmp
a_480p
a_720p
b_1080p
b_480p
c_1080p
c_720p
Running the find
command piped to the awk script named awkparse.sh
would be as follows (don't forget to make the awk script executable):
$ find ./tmp -type f -regex "^.*[0-9]+p$" | ./awkparse.sh
Looking at the directory after piping the results of find
to the awk
script, the tmp/
directory now only contains the highest resolution (largest) files for any given filename, e.g.
$ ls -1
a_720p
b_1080p
c_1080p
This would be highly efficient. It could also handle all files in a nested directory structure where multiple directory levels hold files you need to clean out. Look things over and let me know if you have questions.