Sort and merge 2 files without duplicate lines, based on the first column
I have a file with all the tests name:
$ cat all_tests.txt
test1
test2
test3
test4
test5
test6
And another file containing the test names and the associated result:
$ cat completed_tests.txt
test1 Passed
test3 Failed
test5 Passed
test6 Passed
How to create a new file containing all the test names with the associated result without duplicates?
If I execute:
sort all_tests.txt completed_tests.txt
The output contains duplicates:
test1
test1 Passed
test2
test3
test3 Failed
test4
test5
test5 Passed
test6
test6 Passed
The desired output:
test1 Passed
test2
test3 Failed
test4
test5 Passed
test6 Passed
Solution 1:
Seems like you can achieve this with join
very easily if the files are both sorted.
$ join -a 1 all_test.txt completed_test.txt
test1 Passed
test2
test3 Failed
test4
test5 Passed
test6 Passed
-a 1
means print lines from file 1 that had nothing joined to them.
If your files are not already sorted, you can use this (thanks terdon!):
join -a 1 <(sort all_tests.txt) <(sort completed_tests.txt )
Solution 2:
The right tool here is join
as suggested by @Zanna, but here's an awk
approach:
$ awk 'NR==FNR{a[$1]=$2; next}{print $1,a[$1]}' completed_tests.txt all_tests.txt
test1 Passed
test2
test3 Failed
test4
test5 Passed
test6 Passed
Solution 3:
Perl
Effectively, this is a port of terdon's answer:
$ perl -lane '$t+=1; $h{$F[0]}=$F[1] if $.==$t; print $F[0]," ",$h{$F[0]} if $t!=$.;$.=0 if eof' completed_tests.txt all_tests.txt
test1 Passed
test2
test3 Failed
test4
test5 Passed
test6 Passed
This works by building hash of test-status pairs from completed_test.txt
and then looking up lines in all_tests.txt
in that hash. The $t
variable of total lines processed from each file and $.
that is reset upon reaching end of file, allow us to keep track of which file is currently read.