How to delete smallest file if names are duplicate

I would like to clean up a folder with videos. I have a bunch of videos that were downloaded with different resolutions, so each file will start with the same name and then end with "_480p" or "_720p" etc.

I just want to keep the largest file of each such set.

So I am looking for a way to delete files based on

check if name before "_" is identical
if true, then delete all files except largest one

Thinking of a flexible and fast way to approach the problem, you can gather a list of files ending in "[[:digit:]]+p" and then a quick way to parse the names is to provide them on stdin to awk and let awk index an array with the file prefix (path + part of name before '_') so it will be unique for files allowing the different format size to be obtained and stored at that index.

Then it's a simply matter of comparing the stored resolution number for the file against the current file number and deleting the lesser of the two.

Your find command to locate all files in the directory below the current, recursively, could be:

find ./tmp -type f -regex "^.*[0-9]+p$"

What I would do is then pipe the filename output to a short awk script where an array stores the last seen number for a given file prefix, and then if the current record (line) resolution number if bigger than the value stored in the array, a filename using the array number is created and that file deleted with system() using rm filename. If the current line resolution number is less than what is already stored in the array for the file, you simply delete the current file.

You can do that as:

#!/usr/bin/awk -f

BEGIN { FS = "/" }
{
  num = $NF        # last field holds number up to 'p'
  prefix = $0      # prefix is name up to "_[[:digit:]]+p
  
  sub (/^.*_/, "", num)                 # isolate number
  sub (/p$/, "", num)                   # remove 'p' at and
  sub (/_[[:digit:]]+p$/, "", prefix)   # isolate path and name prefix
  
  if (prefix in a) {                    # current file in array a[] ?
    rmfile = $0                         # set file to remove to current
    if (num + 0 > a[prefix] + 0) {      # current number > array number
      rmfile = prefix "_" a[prefix] "p" # for remove filename from array
      a[prefix] = num                   # update array with higher num
    }
    system ("rm " rmfile);              # delete the file 
  }
  else
    a[prefix] = num     # if no num for prefix in array, store first
}

(note: the field-separator splits the fields using the directory separator so you have all file components to work with.)

Example Use/Output

With a representative set of files in a tmp/ directory below the current, e,g.

$ ls -1 tmp
a_480p
a_720p
b_1080p
b_480p
c_1080p
c_720p

Running the find command piped to the awk script named awkparse.sh would be as follows (don't forget to make the awk script executable):

$ find ./tmp -type f -regex "^.*[0-9]+p$" | ./awkparse.sh

Looking at the directory after piping the results of find to the awk script, the tmp/ directory now only contains the highest resolution (largest) files for any given filename, e.g.

$ ls -1
a_720p
b_1080p
c_1080p

This would be highly efficient. It could also handle all files in a nested directory structure where multiple directory levels hold files you need to clean out. Look things over and let me know if you have questions.

How to delete smallest file if names are duplicate

Related

Recent Posts