merging tab delimited txt files based on column (which is header ) in bash?

Im having two text files which contains million records , all the records are tab delimited , how we can merge these two files based on same header(column)?

file:1

    LogEntryTime              nameId       PartnerId        
    2021-06-05T15:00:53 07    5lsddf        qyutxwr 
        
        

file:2

        nameId  GroupId  compnayId
        5lsddf  l4buafm   0rd33cs               
    

output like this:

    LogEntryTime              nameId       PartnerId    GroupId  compnayId
    2021-06-05T15:00:53 07    5lsddf        qyutxwr     l4buafm   0rd33cs

Tried this but not working:

paste file1.txt file2.txt | nameId -s $'\t' -t

and

cat file1.txt file2.txt |  awk -F '\t' '{print $ list the all columns name here}'

awk one which is working but need to mention all the column numbers there.

is there any other solution help me out.

thanks in advance


Solution 1:

If your files are properly constructed tab separated (TSV) files, then you can use csvjoin from the Python-based csvkit package.

Ex. given:

$ head file1.tsv file2.tsv | cat -A
==> file1.tsv <==$
LogEntryTime^InameId^IPartnerId$
2021-06-05T15:00:53 07^I5lsddf^Iqyutxwr$
$
==> file2.tsv <==$
nameId^IGroupId^IcompnayId$
5lsddf^Il4buafm^I0rd33cs$

(cat -A to make the tabs visible, as ^I) then

$ csvjoin -I -t -c nameId file1.tsv file2.tsv
LogEntryTime,nameId,PartnerId,GroupId,compnayId
2021-06-05T15:00:53 07,5lsddf,qyutxwr,l4buafm,0rd33cs

To get the output back in TSV format, use csvformat from the same package:

$ csvjoin -I -t -c nameId file1.tsv file2.tsv | csvformat -T
LogEntryTime    nameId  PartnerId       GroupId compnayId
2021-06-05T15:00:53 07  5lsddf  qyutxwr l4buafm 0rd33cs

Note that -I disables type inference - which can sometimes behave unexpectedly, especially with datetime fields.


Even simpler, using Miller (available from the universe repository, as package miller):

$ mlr --tsv join -f file1.tsv -j nameId then reorder -f LogEntryTime file2.tsv
LogEntryTime    nameId  PartnerId       GroupId compnayId
2021-06-05T15:00:53 07  5lsddf  qyutxwr l4buafm 0rd33cs

The reorder is necessary because by default mlr join outputs the common field first (just like the system join command). Note that for unsorted input, the whole of file1.tsv will be loaded into memory.

Solution 2:

Loop one of the files into an array and replace the first field of the second file (which is nameId) with the array index that correlates to the common field.

awk -F \\t+ -vOFS=\\t 'NR==FNR{a[$2]=$0;next} {$1=a[$1]}1' file{1,2}.txt

Solution 3:

With this particular set of data:

awk '
    BEGIN {FS = OFS = "\t"}
    NR == FNR {f1[$2] = $0; next}
    {$1 = f1[$1]; print}
' file{1,2}.txt

Only the join field ($2 in file1, $1 in file2) is mentioned.

Produces the tab-separated output

LogEntryTime    nameId  PartnerId   GroupId compnayId
2021-06-05T15:00:53 07  5lsddf  qyutxwr l4buafm 0rd33cs

For pretty output, pipe into | column -t -s $'\t' to get

LogEntryTime            nameId  PartnerId  GroupId  compnayId
2021-06-05T15:00:53 07  5lsddf  qyutxwr    l4buafm  0rd33cs