Sort and merge 2 files without duplicate lines, based on the first column

I have a file with all the tests name:

$ cat all_tests.txt
test1
test2
test3
test4
test5
test6

And another file containing the test names and the associated result:

$ cat completed_tests.txt
test1 Passed
test3 Failed
test5 Passed
test6 Passed

How to create a new file containing all the test names with the associated result without duplicates?

If I execute:

sort all_tests.txt completed_tests.txt

The output contains duplicates:

test1 
test1 Passed
test2
test3 
test3 Failed
test4
test5 
test5 Passed
test6 
test6 Passed

The desired output:

test1 Passed
test2
test3 Failed
test4
test5 Passed
test6 Passed

Solution 1:

Seems like you can achieve this with join very easily if the files are both sorted.

$ join -a 1 all_test.txt completed_test.txt
test1 Passed
test2
test3 Failed
test4
test5 Passed
test6 Passed

-a 1 means print lines from file 1 that had nothing joined to them.

If your files are not already sorted, you can use this (thanks terdon!):

join -a 1  <(sort all_tests.txt) <(sort completed_tests.txt )

Solution 2:

The right tool here is join as suggested by @Zanna, but here's an awk approach:

$ awk 'NR==FNR{a[$1]=$2; next}{print $1,a[$1]}' completed_tests.txt all_tests.txt 
test1 Passed
test2 
test3 Failed
test4 
test5 Passed
test6 Passed

Solution 3:

Perl

Effectively, this is a port of terdon's answer:

$ perl -lane '$t+=1; $h{$F[0]}=$F[1] if $.==$t; print $F[0]," ",$h{$F[0]} if $t!=$.;$.=0 if eof' completed_tests.txt all_tests.txt          
test1 Passed
test2 
test3 Failed
test4 
test5 Passed
test6 Passed

This works by building hash of test-status pairs from completed_test.txt and then looking up lines in all_tests.txt in that hash. The $t variable of total lines processed from each file and $. that is reset upon reaching end of file, allow us to keep track of which file is currently read.