Find duplicates of a file by content

Solution 1:

First find the md5 hash of your file:

$ md5sum path/to/file
e740926ec3fce151a68abfbdac3787aa  path/to/file

(the first line is the command you need to execute, the second line is the md5 hash of that file)

Then copy the hash (it would be different in your case) and paste it into the next command:

$ find . -type f -print0 | xargs -0 md5sum | grep e740926ec3fce151a68abfbdac3787aa
e740926ec3fce151a68abfbdac3787aa  ./path/to/file
e740926ec3fce151a68abfbdac3787aa  ./path/to/other/file/with/same/content
....

If you want to get fancy you could combine the 2 in a single command:

$ find . -type f -print0 | xargs -0 md5sum | grep `md5sum path/to/file | cut -d " " -f 1`
e740926ec3fce151a68abfbdac3787aa  ./path/to/file
e740926ec3fce151a68abfbdac3787aa  ./path/to/other/file/with/same/content
....

You could use sha1 or any of the other fancy hashes if you want.

Edit

If the use case is to search through "several multi-gigabyte MP4s or iso-files" to find a "4 KB jpg" (as per @Tijn answer) then specifying the file size would speed things up dramatically.

If the size of the file you are looking for is exactly 3952 bytes (you can see that using ls -l path/to/file then this command would perform much faster:

$ find . -type f -size 3952c -print0 | xargs -0 md5sum | grep e740926ec3fce151a68abfbdac3787aa
e740926ec3fce151a68abfbdac3787aa  ./path/to/file
e740926ec3fce151a68abfbdac3787aa  ./path/to/other/file/with/same/content

Note the extra c after the size, indicating characters/bytes.

If you want you could combine this in a single command:

FILE=./path/to/file && find . -type f -size $(du -b $FILE | cut -f1)c -print0 | xargs -0 md5sum | grep $(md5sum $FILE | cut -f1 -d " ")

Solution 2:

Use diff command with boolean operators && and ||

bash-4.3$ diff /etc/passwd passwd_duplicate.txt > /dev/null && echo "SAME CONTENT" || echo "CONTENT DIFFERS"
SAME CONTENT

bash-4.3$ diff /etc/passwd TESTFILE.txt > /dev/null && echo "SAME CONTENT" || echo "CONTENT DIFFERS"
CONTENT DIFFERS

If you want to go over multiple files in specific directory, cd there and use a for loop like so:

bash-4.3$ for file in * ; do  diff /etc/passwd "$file" > /dev/null && echo "$file has same contents" || echo "$file has different contents"; done
also-waste.txt has different contents
directory_cleaner.py has different contents
dontdeletethisfile.txt has different contents
dont-delete.txt has different contents
important.txt has different contents
list.txt has different contents
neverdeletethis.txt has different contents
never-used-it.txt has different contents
passwd_dulicate.txt has same contents

For recursive cases, use find command to traverse directory and all its subdirectories(mind the quotes and all the appropriate slashes):

bash-4.3$ find . -type f -exec sh -c 'diff /etc/passwd "{}" > /dev/null &&  echo "{} same" || echo "{} differs"' \;
./reallyimportantfile.txt differs
./dont-delete.txt differs
./directory_cleaner.py differs
./TESTFILE.txt differs
./dontdeletethisfile.txt differs
./neverdeletethis.txt differs
./important.txt differs
./passwd_dulicate.txt same
./this-can-be-deleted.txt differs
./also-waste.txt differs
./never-used-it.txt differs
./list.txt differs

Solution 3:

You can use filecmp in Python

For example:

import filecmp 
print filecmp.cmp('filename.png', 'filename.png') 

Will print True if equals, otherwise False

Solution 4:

Get the md5sum of the file in question, and save in a variable e.g. md5:

md5=$(md5sum file.txt | awk '{print $1}')

Use find to traverse the desired directory tree, and check if any file has the same hash value, if so print the file name:

find . -type f -exec sh -c '[ "$(md5sum "$1" | awk "{print \$1}")" = "$2" ] \
                             && echo "$1"' _ {} "$md5" \;
  • find . -type f finds all files in the current directory, change the directory to meet your need

  • the -exec predicate executes the command sh -c ... on all files found

  • In sh -c, _ is a placeholder for $0, $1 is the file found, $2 is $md5

  • [ $(md5sum "$1"|awk "{print \$1}") = "$2" ] && echo "$1" prints the filename if the hash value of the file is same as the one we are checking duplicates for

Example:

% md5sum ../foo.txt bar.txt 
d41d8cd98f00b204e9800998ecf8427e  ../foo.txt
d41d8cd98f00b204e9800998ecf8427e  bar.txt

% md5=$(md5sum ../foo.txt | awk '{print $1}')

% find . -type f -exec sh -c '[ "$(md5sum "$1" | awk "{print \$1}")" = "$2" ] && echo "$1"' _ {} "$md5" \;
bar.txt