Joining multiple fields in text files on Unix

Here is the correct answer (in terms of using standard GNU coreutils tools, and not writing custom script in perl/awk you name it).

$ join -j1 -o1.2,1.3,1.4,1.5,2.5 <(<file1 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1) <(<file2 awk '{print $1"-"$2"-"$3" "$0}' | sort -k1,1)
bar 2 scaf 3.3 1.00
foo 1 scaf 3 4.5

OK, how does it work:

  1. First of all we will use a great tool join which can merge two lines. join has two requirements:

    • We can join only by a single field.
    • Both files must be sorted by key column!
  2. We need to generate keys in input files and for that we use a simple awk script:

    $ cat file1
    foo 1 scaf 3
    bar 2 scaf 3.3    
    
    $ <file1 awk '{print $1"-"$2"-"$3" "$0}'
    foo-1-scaf foo 1 scaf 3
    bar-2-scaf bar 2 scaf 3.3
    

    You see, we added 1st column with some key like "foo-1-scaf". We do the same with file2. BTW. <file awk, is just fancy way of writing awk file, or cat file | awk.

    We also should sort our files by the key, in our case this is column 1, so we add to the end of the command the | sort -k1,1 (sort by text from column 1 to column 1)

  3. At this point we could just generate files file1.with.key and file2.with.key and join them, but suppose those file are huge, we don't want to copy them over filesystem. Instead we can use something called bash process substitution to generate output into named pipe (this will avoid any unnecessary intermediate file creation). For more info please read the provided link.

    Our target syntax is: join <( some command ) <(some other command)

  4. The last thing is to explain fancy join arguments: -j1 -o1.2,1.3,1.4,1.5,2.5

    • -j1 - join by key in 1st column (in both files)
    • -o - output only those fields 1.2 (1st file field2), 1.3 (1st file column 3), etc.

      This way we joined lines, but join outputs only the necessary columns.

The lessons learned from this post should be:

  • you should master the coreutils package, those tools are very powerful when combined and you almost never need to write custom program to deal with such cases,
  • core utils tools are also blazing fast and heavily tested, so they are always best choice.

The join command is hard to use and only joins on one column

Extensive experimentation plus close scrutiny of the manual pages indicates that you cannot directly join multiple columns - and all my working examples of join, funnily enough, use just one joining column.

Consequently, any solution will require the columns-to-be-joined to be concatenated into one column, somehow. The standard join command also requires its inputs to be in the correct sorted order - there's a remark in the GNU join (info coreutils join) about it not always requiring sorted data:

However, as a GNU extension, if the input has no unpairable lines the sort order can be any order that considers two fields to be equal if and only if the sort comparison described above considers them to be equal.

One possible way to do it with the given files is:

awk '{printf("%s:%s:%s %s %s %s %s\n", $1, $2, $3, $1, $2, $3, $4);}' file1 |
sort > sort1
awk '{printf("%s:%s:%s %s %s %s %s\n", $1, $2, $3, $1, $2, $3, $4);}' file2 |
sort > sort2
join -1 1 -2 1 -o 1.2,1.3,1.4,1.5,2.5 sort1 sort2

This creates a composite sort field at the start, using ':' to separate the sub-fields, and then sorts the file - for each of two files. The join command then joins on the two composite fields, but prints out only the non-composite (non-join) fields.

The output is:

bar 2 scaf 3.3 1.00
foo 1 scaf 3 4.5

Failed attempts to make join do what it won't do

join -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2

On MacOS X 10.6.3, this gives:

$ cat file1
foo 1 scaf 3 
bar 2 scaf 3.3
$ cat file2
foo 1 scaf 4.5
foo 1 boo 2.3
bar 2 scaf 1.00
$ join -1 1 -2 1 -1 2 -2 2 -1 3 -2 3 -o 1.1,1.2,1.3,1.4,2.4 file1 file2
foo 1 scaf 3 4.5 
bar 2 scaf 3.3 4.5 
$

This is joining on field 3 (only) - which is not what is wanted.

You do need to ensure that the input files are in the correct sorted order.