Deleting duplicate lines in text file.....?

How can I delete duplicate lines in a text file via command prompt?

For Example: I have a 10MB text file and I want to keep only one line of My line, but somewhere in the text file there are 2 My lines.


Using awk

awk '!x[$0]++' infile.txt > outfile.txt

the way it works is that it keeps count of the lines in an array, and if the current count is zero, ie the first occurance, it prints the line, otherwise it continues to the next one.


There are multiple ways to do this. If ordering is not important then, sort and uniq are easiest to remember. However if you want to maintain ordering of the text file yet delete duplicates then awk does the trick. You can also use sed I believe.

Here is an example

/tmp/debugSys>cat fileWithDupText.txt 
line2
line21
line2
line1
line2
/tmp/debugSys>

/tmp/debugSys>cat fileWithDupText.txt | awk '!a[$0]++' 
line2
line21
line1
/tmp/debugSys>sort fileWithDupText.txt | uniq
line1
line2
line21
/tmp/debugSys>sort -u fileWithDupText.txt 
line1
line2
line21
/tmp/debugSys>

Found a nice Perl one-liner for that, using md5 hashes ;), but this is slow and only worth it if you have very long lines and a huge file where it will greatly reduce memory use:

perl -MDigest::MD5 -ne '$seen{Digest::MD5::md5($_)}++ or print' foo

Therefore simply use

perl -ne '$seen{$_}++ or print' foo

Source


Example

cat foo

foo
fii
foo bar
foobar
foobar
foo

perl -ne '$seen{$_}++ or print' foo

foo
fii
foo bar
foobar