Remove duplicate lines without sorting [duplicate]
The UNIX Bash Scripting blog suggests:
awk '!x[$0]++'
This command is telling awk which lines to print. The variable $0
holds the entire contents of a line and square brackets are array access. So, for each line of the file, the node of the array x
is incremented and the line printed if the content of that node was not (!
) previously set.
A late answer - I just ran into a duplicate of this - but perhaps worth adding...
The principle behind @1_CR's answer can be written more concisely, using cat -n
instead of awk
to add line numbers:
cat -n file_name | sort -uk2 | sort -n | cut -f2-
- Use
cat -n
to prepend line numbers - Use
sort -u
remove duplicate data (-k2
says 'start at field 2 for sort key') - Use
sort -n
to sort by prepended number - Use
cut
to remove the line numbering (-f2-
says 'select field 2 till end')
To remove duplicate from 2 files :
awk '!a[$0]++' file1.csv file2.csv
Michael Hoffman's solution above is short and sweet. For larger files, a Schwartzian transform approach involving the addition of an index field using awk followed by multiple rounds of sort and uniq involves less memory overhead. The following snippet works in bash
awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'
Now you can check out this small tool written in Rust: uq.
It performs uniqueness filtering without having to sort the input first, therefore can apply on continuous stream.
There are two advantages of this tool over the top-voted awk solution and other shell-based solutions:
-
uq
remembers the occurence of lines using their hash values, so it doesn't use as much memory use when the lines are long. -
uq
can keep the memory usage constant by setting a limit on the number of entries to store (when the limit is reached, there is a flag to control either to override or to die), while theawk
solution could run into OOM when there are too many lines.