Combine two files with awk

File1.txt

item1   carA
item2   carB
item3   carC
item4   platD
item5   carE

File2.txt

carA  platA
carB  platB
carC  platC
carE  platE

Wanted output:

item1   platA
item2   platB
item3   platC
item4   platD
item5   platE

How can I do it?


Solution 1:

The below answer is based on a similar Q&A in SO with some relevant modifications:

$ awk 'FNR==NR {dict[$1]=$2; next} {$2=($2 in dict) ? dict[$2] : $2}1' file2.txt file1.txt 
item1 platA
item2 platB
item3 platC
item4 platD
item5 platE

The idea is to create a hash-map with index, and use it as dictionary.

For the 2nd question you asked in your comment (what should be changed if the second column of file1.txt will be the sixth column):

If the input file will be like file1b.txt :

item1 A5 B C D carA
item2 A4 1 2 3 carB
item3 A3 2 3 4 carC
item4 A2 4 5 6 platD
item5 A1 7 8 9 carE

The following command will do it:

$ awk 'FNR==NR {dict[$1]=$2; next} {$2=($6 in dict) ? dict[$6] : $6;$3="";$4="";$5="";$6=""}1' file2.txt file1b.txt 
item1 platA    
item2 platB    
item3 platC    
item4 platD    
item5 platE    

Solution 2:

I know you said awk, but there is a join command for this purpose...

{
  join -o 1.1,2.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt)     
  join -v 1 -o 1.1,1.2 -1 2 -2 1 <(sort -k 2 File1.txt) <(sort -k 1 File2.txt) 
} | sort -k 1

It'd be sufficient with the first join command if it wasn't for this line:

item4   platD

The command basically says: join based on the second column of the first file (-1 2), and the first column of the second file (-2 1), and output the first column of the first file and the second column of the second file (-o 1.1,2.2). That only shows the lines that paired. The second join command says almost the same thing, but it says to show the lines from the first file that couldn't be paired (-v 1) , and output the first column of the first file and the second column of the first file (-o 1.1,1.2). Then we sort the output of both combined. sort -k 1 means sort based on the first column, and sort -k 2 means to sort based on the second. It's important to sort the files based on the join column before passing them to join.

Now, I wrote the sorting twice, because I don't like to litter my directories with files if I can help it. However, like David Foerster said, depending on the size of the files, you might want to sort the files and save them first to not have wait to sort each twice. To give an idea of sizes, here is the time it takes to sort 1 million and 10 million lines on my computer:

$ ruby -e '(1..1000000).each {|i| puts "item#{i}   plat#{i}"}' | shuf > 1million.txt 
$ ruby -e '(1..10000000).each {|i| puts "item#{i}   plat#{i}"}' | shuf > 10million.txt 
$ head 10million.txt 
item530284   plat530284
item7946579   plat7946579
item1521735   plat1521735
item9762844   plat9762844
item2289811   plat2289811
item6878181   plat6878181
item7957075   plat7957075
item2527811   plat2527811
item5940907   plat5940907
item3289494   plat3289494
$ TIMEFORMAT=%E
$ time sort 1million.txt >/dev/null
1.547
$ time sort 10million.txt >/dev/null
19.187

That's 1.5 seconds for 1 million lines, and 19 seconds for 10 million lines.