How to remove duplicated files in a directory?
I downloaded a lot of images in a directory.
Downloader renamed files which already exist.
I also renamed some of the files manually.
a.jpg
b.jpg
b(2).jpg
hello.jpg <-- manually renamed `b(3).jpg`
c.jpg
c(2).jpg
world.jpg <-- manually renamed `d.jpg`
d(2).jpg
d(3).jpg
How to remove duplicated ones? The result should be:
a.jpg
b.jpg
c.jpg
world.jpg
note: name doesn't matter. I just want uniq files.
bash 4.x
#!/bin/bash
declare -A arr
shopt -s globstar
for file in **; do
[[ -f "$file" ]] || continue
read cksm _ < <(md5sum "$file")
if ((arr[$cksm]++)); then
echo "rm $file"
fi
done
This is both recursive and handles any file name. Downside is that it requires version 4.x for the ability to use associative arrays and recursive searching. Remove the echo
if you like the results.
gawk version
gawk '
{
cmd="md5sum " q FILENAME q
cmd | getline cksm
close(cmd)
sub(/ .*$/,"",cksm)
if(a[cksm]++){
cmd="echo rm " q FILENAME q
system(cmd)
close(cmd)
}
nextfile
}' q='"' *
Note that this will still break on files that have double-quotes in their name. No real way to get around that with awk
. Remove the echo
if you like the results.
fdupes is the tool of your choice. To find all duplicate files (by content, not by name) in the current directory:
fdupes -r .
To manually confirm deletion of duplicated files:
fdupes -r -d .
To automatically delete all copies but the first of each duplicated file (be warned, this warning, this actually deletes files, as requested):
fdupes -r -f . | grep -v '^$' | xargs rm -v
I'd recommend to manually check files before deletion:
fdupes -rf . | grep -v '^$' > files
... # check files
xargs -a files rm -v
You can try FSLint. It has both command line and GUI interface.
How to test files having unique content?
if diff "$file1" "$file2" > /dev/null; then
...
How can we get list of files in directory?
files="$( find ${files_dir} -type f )"
We can get any 2 files from that list and check if their names are different and content are same.
#!/bin/bash
# removeDuplicates.sh
files_dir=$1
if [[ -z "$files_dir" ]]; then
echo "Error: files dir is undefined"
fi
files="$( find ${files_dir} -type f )"
for file1 in $files; do
for file2 in $files; do
# echo "checking $file1 and $file2"
if [[ "$file1" != "$file2" && -e "$file1" && -e "$file2" ]]; then
if diff "$file1" "$file2" > /dev/null; then
echo "$file1 and $file2 are duplicates"
rm -v "$file2"
fi
fi
done
done
For example, we have some dir:
$> ls .tmp -1
all(2).txt
all.txt
file
text
text(2)
So there are only 3 unique files.
Lets run that script:
$> ./removeDuplicates.sh .tmp/
.tmp/text(2) and .tmp/text are duplicates
removed `.tmp/text'
.tmp/all.txt and .tmp/all(2).txt are duplicates
removed `.tmp/all(2).txt'
And we get only 3 files leaved.
$> ls .tmp/ -1
all.txt
file
text(2)
I wrote this tiny script to delete duplicated files
https://gist.github.com/crodas/d16a16c2474602ad725b
Basically it uses a temporary file (/tmp/list.txt
) to create a map of files and their hashes. Later I use that files and the magic of Unix pipes to do the rest.
The script won't delete anything but will print the commands to delete files.
mfilter.sh ./dir | bash
Hope it helps