Command that will only print value once although it appears many times
I have a big txt file in which values are are repeating many times. Is there some command that I can use that will go through file and if one value appears once do not repeat it again?
SO4
HOH
CL
BME
HOH
SO4
HOH
CL
BME
HOH
SO4
HOH
SO4
HOH
CL
BME
HOH
SO4
HOH
CL
BME
HOH
CL
So it should look something like this:
S04
HOH
CL
BME
The thing is that I have huge number of different values, so can't do it manualy like here.
Solution 1:
If you want to keep the output lines in the same order as the input lines, use:
$ awk '!a[$0]++' file
SO4
HOH
CL
BME
How it works:
This uses associative array a
to count the number of times each line has been previously seen. If it has not been previously seen, the line is printed.
Solution 2:
You could use the command sort
with the option --unique
:
sort -u input-file
If you want to write result to FILE instead of standard output, use the option --output=FILE
:
sort -u input-file -o output-file
The command uniq
also could be applied. In this case the identical lines must be consequential, so the input must be sorted preliminary - thanks to @RonJohn for this note:
sort input-file | uniq > output-file
I like the sort
command for similar cases, because of its simplicity, but if you work with large arrays the awk
approach from John1024's answer could be more powerful. Here is a time comparison between the mentioned approaches, applied on a file (based on the above example) with almost 5 million lines:
$ cat input-file | wc -l
20000000
$ TIMEFORMAT=%R
$ time sort -u input-file | wc -l
64
7.495
$ time sort input-file | uniq | wc -l
64
7.703
$ time awk '!a[$0]++' input-file | wc -l # from John1024's answer
64
1.271
$ time datamash rmdup 1 < input-file | wc -l # from αғsнιη's answer
64
0.770
Other significant difference is that mentioned by @Ruslan:
sort -u
will only print the result once the input has ended, while thisawk
command will do print each new result line on the fly (this may be more important for piped input than file).
Here is an illustration:
In the above example, the loop (shown below) generates 500 random combinations, each with a length of three characters, of the letters A-D. These combinations are piped to awk
or sort
.
for i in {1..500}; do cat /dev/urandom | tr -dc A-D | head -c 3; echo; done