Find duplicates of a file by content
Solution 1:
First find the md5 hash of your file:
$ md5sum path/to/file
e740926ec3fce151a68abfbdac3787aa path/to/file
(the first line is the command you need to execute, the second line is the md5 hash of that file)
Then copy the hash (it would be different in your case) and paste it into the next command:
$ find . -type f -print0 | xargs -0 md5sum | grep e740926ec3fce151a68abfbdac3787aa
e740926ec3fce151a68abfbdac3787aa ./path/to/file
e740926ec3fce151a68abfbdac3787aa ./path/to/other/file/with/same/content
....
If you want to get fancy you could combine the 2 in a single command:
$ find . -type f -print0 | xargs -0 md5sum | grep `md5sum path/to/file | cut -d " " -f 1`
e740926ec3fce151a68abfbdac3787aa ./path/to/file
e740926ec3fce151a68abfbdac3787aa ./path/to/other/file/with/same/content
....
You could use sha1 or any of the other fancy hashes if you want.
Edit
If the use case is to search through "several multi-gigabyte MP4s or iso-files" to find a "4 KB jpg" (as per @Tijn answer) then specifying the file size would speed things up dramatically.
If the size of the file you are looking for is exactly 3952 bytes (you can see that using ls -l path/to/file
then this command would perform much faster:
$ find . -type f -size 3952c -print0 | xargs -0 md5sum | grep e740926ec3fce151a68abfbdac3787aa
e740926ec3fce151a68abfbdac3787aa ./path/to/file
e740926ec3fce151a68abfbdac3787aa ./path/to/other/file/with/same/content
Note the extra c
after the size, indicating characters/bytes.
If you want you could combine this in a single command:
FILE=./path/to/file && find . -type f -size $(du -b $FILE | cut -f1)c -print0 | xargs -0 md5sum | grep $(md5sum $FILE | cut -f1 -d " ")
Solution 2:
Use diff command with boolean operators &&
and ||
bash-4.3$ diff /etc/passwd passwd_duplicate.txt > /dev/null && echo "SAME CONTENT" || echo "CONTENT DIFFERS"
SAME CONTENT
bash-4.3$ diff /etc/passwd TESTFILE.txt > /dev/null && echo "SAME CONTENT" || echo "CONTENT DIFFERS"
CONTENT DIFFERS
If you want to go over multiple files in specific directory, cd
there and use a for
loop like so:
bash-4.3$ for file in * ; do diff /etc/passwd "$file" > /dev/null && echo "$file has same contents" || echo "$file has different contents"; done
also-waste.txt has different contents
directory_cleaner.py has different contents
dontdeletethisfile.txt has different contents
dont-delete.txt has different contents
important.txt has different contents
list.txt has different contents
neverdeletethis.txt has different contents
never-used-it.txt has different contents
passwd_dulicate.txt has same contents
For recursive cases, use find
command to traverse directory and all its subdirectories(mind the quotes and all the appropriate slashes):
bash-4.3$ find . -type f -exec sh -c 'diff /etc/passwd "{}" > /dev/null && echo "{} same" || echo "{} differs"' \;
./reallyimportantfile.txt differs
./dont-delete.txt differs
./directory_cleaner.py differs
./TESTFILE.txt differs
./dontdeletethisfile.txt differs
./neverdeletethis.txt differs
./important.txt differs
./passwd_dulicate.txt same
./this-can-be-deleted.txt differs
./also-waste.txt differs
./never-used-it.txt differs
./list.txt differs
Solution 3:
You can use filecmp in Python
For example:
import filecmp
print filecmp.cmp('filename.png', 'filename.png')
Will print True if equals, otherwise False
Solution 4:
Get the md5sum
of the file in question, and save in a variable e.g. md5
:
md5=$(md5sum file.txt | awk '{print $1}')
Use find
to traverse the desired directory tree, and check if any file has the same hash value, if so print the file name:
find . -type f -exec sh -c '[ "$(md5sum "$1" | awk "{print \$1}")" = "$2" ] \
&& echo "$1"' _ {} "$md5" \;
find . -type f
finds all files in the current directory, change the directory to meet your needthe
-exec
predicate executes the commandsh -c ...
on all files foundIn
sh -c
,_
is a placeholder for$0
,$1
is the file found,$2
is$md5
[ $(md5sum "$1"|awk "{print \$1}") = "$2" ] && echo "$1"
prints the filename if the hash value of the file is same as the one we are checking duplicates for
Example:
% md5sum ../foo.txt bar.txt
d41d8cd98f00b204e9800998ecf8427e ../foo.txt
d41d8cd98f00b204e9800998ecf8427e bar.txt
% md5=$(md5sum ../foo.txt | awk '{print $1}')
% find . -type f -exec sh -c '[ "$(md5sum "$1" | awk "{print \$1}")" = "$2" ] && echo "$1"' _ {} "$md5" \;
bar.txt