Remove all but every 12th file

bash rm

I have a few thousand files in the format filename.12345.end . I only want to keep every 12th file, so file.00012.end, file.00024.end ... file.99996.end and delete everything else.

The files may also have numbers earlier in their filename, and are normally of the form: file.00064.name.99999.end

I use Bash shell and can't figure out how to loop over the files and then get out the number and check whether it is number%%12=0 deleting the file if not. Can anyone help me?

Thank you, Dorina

Solution 1:

Here's a Perl solution. This should be much faster for thousands of files:

perl -e '@bad=grep{/(\d+)\.end/ && $1 % 12 != 0}@ARGV; unlink @bad' *

Which can be further condensed into:

perl -e 'unlink grep{/(\d+)\.end/ && $1 % 12 != 0}@ARGV;' *

If you have too many files and can't use the simple *, you can do something like:

perl -e 'opendir($d,"."); unlink grep{/(\d+)\.end/ && $1 % 12 != 0} readdir($dir)'

As for speed, here's a comparison of this approach and the shell one provided in one of the other answers:

$ touch file.{01..64}.name.{00001..01000}.end
$ ls | wc
  64000   64000 1472000
$ time for f in ./* ; do file="${f%.*}"; if [[ $((10#${file##*.} % 12)) -ne 0 ]]; then rm "$f"; fi; done

real    2m44.258s
user    0m9.183s
sys     1m7.647s

$ touch file.{01..64}.name.{00001..01000}.end
$ time perl -e 'unlink grep{/(\d+)\.end/ && $1 % 12 != 0}@ARGV;' *

real    0m0.610s
user    0m0.317s
sys     0m0.290s

As you can see, the difference is enormous, as expected.

Explanation

The -e is simply telling perl to run the script given on the command line.
@ARGV is a special variable containing all the arguments given to the script. Since we're giving it *, it will contain all the files (and directories) in the current directory.
The grep will search through the list of file names and look for any that match a string of numbers, a dot and end (/(\d+)\.end/).
Because the numbers (\d) are in a capture group (parentheses), they are saved as $1. So the grep will then check whether that number is a multiple of 12 and, if it isn't, the file name will be returned. In other words, the array @bad holds the list of files to be deleted.
The list is then passed to unlink() which removes files(but not directories).

Solution 2:

Given that your filenames are in the format file.00064.name.99999.end, we first need to trim away everything except our number. We'll use a for loop to do this.

We also need to tell the Bash shell to use base 10, because Bash arithmetic will treat them numbers beginning with a 0 as base 8, which will mess things up for us.

As a script, to be launched when in the directory containing files use:

#!/bin/bash

for f in ./*
do
  if [[ -f "$f" ]]; then
    file="${f%.*}"
    if [[ $((10#${file##*.} % 12)) -ne 0 ]]; then
      rm "$f"
    fi
  else
    echo "$f is not a file, skipping."
  fi
done

Or you can use this very long ugly command to do the same thing:

for f in ./* ; do if [[ -f "$f" ]]; then file="${f%.*}"; if [[ $((10#${file##*.} % 12)) -ne 0 ]]; then rm "$f"; fi; else echo "$f is not a file, skipping."; fi; done

To explain all of the parts:

for f in ./* means for everything in the current directory, do.... This sets each file or directory found as the variable $f.
if [[ -f "$f" ]] checks whether the item found is a file, if not we skip to the echo "$f is not... part, which means we don't start deleting directories accidentally.
file="${f%.*}" sets the $file variable as the filename trimming off whatever comes after the last ..
if [[ $((10#${file##*.} % 12)) -eq 0 ]] is where the main Arithmetic kicks in. The ${file##*.} trims everything before the last . in our filename without extension. $(( $num % $num2 )) is the syntax for Bash arithmetic to use the modulo operation, the 10# at the start tells Bash to use base 10, to deal with those pesky leading 0s. $((10#${file##*.} % 12)) then leaves us the remainder of our filenames number divided by 12. -ne 0 checks whether the remainder is "not equal" to zero.
If the remainder is not equal to 0, the file is deleted with the rm command, you may want to replace rm with echo when first running this, to check that you get the expected files to delete.

This solution is non-recursive, meaning that it will only process files in the current directory, it won't go into any sub-directories.

The if statement with the echo command to warn about directories is not really necessary as rm on it's own will complain about directories, and not delete them, so:

#!/bin/bash

for f in ./*
do
  file="${f%.*}"
  if [[ $((10#${file##*.} % 12)) -ne 0 ]]; then
    rm "$f"
  fi
done

for f in ./* ; do file="${f%.*}"; if [[ $((10#${file##*.} % 12)) -ne 0 ]]; then rm "$f"; fi; done

Will work correctly too.

Solution 3:

You can use Bash bracket expansion to generate names containing every 12th number. Let's create some test data

$ touch file.{0..9}{0..9}{0..9}{0..9}{0..9}.end # create test data
$ mv file.00024.end file.00024.end.name.99999.end # testing this form of filenames

Then we can use the following

$ ls 'file.'{00012..100..12}* # print these with numbers less than 100
file.00012.end                 file.00036.end  file.00060.end  file.00084.end
file.00024.end.name.99999.end  file.00048.end  file.00072.end  file.00096.end
$ rm 'file.'{00012..100000..12}* # do the job

Works hopelessly slow for large amount of files though - it takes time and memory to generate thousands of names - so it's more a trick that actual efficient solution.

Solution 4:

A little bit long, but is what came to my mind.

 for num in $(seq 1 1 11) ; do
     for sequence in $(seq -f %05g $num 12 99999) ; do
         rm file.$sequence.end.99999;
     done
 done

Explanation: Delete every 12th file eleven times.