How to remove duplicated files in a directory?

I downloaded a lot of images in a directory.
Downloader renamed files which already exist.
I also renamed some of the files manually.

a.jpg
b.jpg
b(2).jpg
hello.jpg      <-- manually renamed `b(3).jpg`
c.jpg
c(2).jpg
world.jpg      <-- manually renamed `d.jpg`
d(2).jpg
d(3).jpg

How to remove duplicated ones? The result should be:

a.jpg
b.jpg
c.jpg
world.jpg

note: name doesn't matter. I just want uniq files.


bash 4.x

#!/bin/bash
declare -A arr
shopt -s globstar

for file in **; do
  [[ -f "$file" ]] || continue
   
  read cksm _ < <(md5sum "$file")
  if ((arr[$cksm]++)); then 
    echo "rm $file"
  fi
done

This is both recursive and handles any file name. Downside is that it requires version 4.x for the ability to use associative arrays and recursive searching. Remove the echo if you like the results.

gawk version

gawk '
  {
    cmd="md5sum " q FILENAME q
    cmd | getline cksm
    close(cmd)
    sub(/ .*$/,"",cksm)
    if(a[cksm]++){
      cmd="echo rm " q FILENAME q
      system(cmd)
      close(cmd)
    }
    nextfile
  }' q='"' *

Note that this will still break on files that have double-quotes in their name. No real way to get around that with awk. Remove the echo if you like the results.


fdupes is the tool of your choice. To find all duplicate files (by content, not by name) in the current directory:

fdupes -r .

To manually confirm deletion of duplicated files:

fdupes -r -d .

To automatically delete all copies but the first of each duplicated file (be warned, this warning, this actually deletes files, as requested):

fdupes -r -f . | grep -v '^$' | xargs rm -v

I'd recommend to manually check files before deletion:

fdupes -rf . | grep -v '^$' > files
... # check files
xargs -a files rm -v

You can try FSLint. It has both command line and GUI interface.


How to test files having unique content?

if diff "$file1" "$file2" > /dev/null; then
    ...

How can we get list of files in directory?

files="$( find ${files_dir} -type f )"

We can get any 2 files from that list and check if their names are different and content are same.

#!/bin/bash
# removeDuplicates.sh

files_dir=$1
if [[ -z "$files_dir" ]]; then
    echo "Error: files dir is undefined"
fi

files="$( find ${files_dir} -type f )"
for file1 in $files; do
    for file2 in $files; do
        # echo "checking $file1 and $file2"
        if [[ "$file1" != "$file2" && -e "$file1" && -e "$file2" ]]; then
            if diff "$file1" "$file2" > /dev/null; then
                echo "$file1 and $file2 are duplicates"
                rm -v "$file2"
            fi
        fi
    done
done

For example, we have some dir:

$> ls .tmp -1
all(2).txt
all.txt
file
text
text(2)

So there are only 3 unique files.

Lets run that script:

$> ./removeDuplicates.sh .tmp/
.tmp/text(2) and .tmp/text are duplicates
removed `.tmp/text'
.tmp/all.txt and .tmp/all(2).txt are duplicates
removed `.tmp/all(2).txt'

And we get only 3 files leaved.

$> ls .tmp/ -1
all.txt
file
text(2)

I wrote this tiny script to delete duplicated files

https://gist.github.com/crodas/d16a16c2474602ad725b

Basically it uses a temporary file (/tmp/list.txt) to create a map of files and their hashes. Later I use that files and the magic of Unix pipes to do the rest.

The script won't delete anything but will print the commands to delete files.

mfilter.sh ./dir | bash

Hope it helps