How Does One Remove Duplicate Text Lines From Files Larger Than 4GB?

Solution 1:

sort -u file > outfile

A handy Win32 native port of sort is in UnxUtils

For more complicated meanings of "remove duplicates" there is Perl (et al).

Solution 2:

If you have Cygwin or MinGW you could probably accomplish this with

cat file | sort | uniq >> outfile

assuming you want unique lines. I know not how this will perform, since sorting a dataset that large will probably take a long time (or if it is already sorted you can just leave that part out) or how, exactly, these commands function (if they will consume 4GB of ram or not).

Solution 3:

You can remove duplicate lines in a huge file with PilotEdit.