merging tab delimited txt files based on column (which is header ) in bash?
Im having two text files which contains million records , all the records are tab delimited , how we can merge these two files based on same header(column)?
file:1
LogEntryTime nameId PartnerId
2021-06-05T15:00:53 07 5lsddf qyutxwr
file:2
nameId GroupId compnayId
5lsddf l4buafm 0rd33cs
output like this:
LogEntryTime nameId PartnerId GroupId compnayId
2021-06-05T15:00:53 07 5lsddf qyutxwr l4buafm 0rd33cs
Tried this but not working:
paste file1.txt file2.txt | nameId -s $'\t' -t
and
cat file1.txt file2.txt | awk -F '\t' '{print $ list the all columns name here}'
awk one which is working but need to mention all the column numbers there.
is there any other solution help me out.
thanks in advance
Solution 1:
If your files are properly constructed tab separated (TSV) files, then you can use csvjoin
from the Python-based csvkit
package.
Ex. given:
$ head file1.tsv file2.tsv | cat -A
==> file1.tsv <==$
LogEntryTime^InameId^IPartnerId$
2021-06-05T15:00:53 07^I5lsddf^Iqyutxwr$
$
==> file2.tsv <==$
nameId^IGroupId^IcompnayId$
5lsddf^Il4buafm^I0rd33cs$
(cat -A
to make the tabs visible, as ^I
) then
$ csvjoin -I -t -c nameId file1.tsv file2.tsv
LogEntryTime,nameId,PartnerId,GroupId,compnayId
2021-06-05T15:00:53 07,5lsddf,qyutxwr,l4buafm,0rd33cs
To get the output back in TSV format, use csvformat
from the same package:
$ csvjoin -I -t -c nameId file1.tsv file2.tsv | csvformat -T
LogEntryTime nameId PartnerId GroupId compnayId
2021-06-05T15:00:53 07 5lsddf qyutxwr l4buafm 0rd33cs
Note that -I
disables type inference - which can sometimes behave unexpectedly, especially with datetime fields.
Even simpler, using Miller (available from the universe repository, as package miller
):
$ mlr --tsv join -f file1.tsv -j nameId then reorder -f LogEntryTime file2.tsv
LogEntryTime nameId PartnerId GroupId compnayId
2021-06-05T15:00:53 07 5lsddf qyutxwr l4buafm 0rd33cs
The reorder
is necessary because by default mlr join
outputs the common field first (just like the system join
command). Note that for unsorted input, the whole of file1.tsv
will be loaded into memory.
Solution 2:
Loop one of the files into an array and replace the first field of the second file (which is nameId
) with the array index that correlates to the common field.
awk -F \\t+ -vOFS=\\t 'NR==FNR{a[$2]=$0;next} {$1=a[$1]}1' file{1,2}.txt
Solution 3:
With this particular set of data:
awk '
BEGIN {FS = OFS = "\t"}
NR == FNR {f1[$2] = $0; next}
{$1 = f1[$1]; print}
' file{1,2}.txt
Only the join field ($2 in file1, $1 in file2) is mentioned.
Produces the tab-separated output
LogEntryTime nameId PartnerId GroupId compnayId
2021-06-05T15:00:53 07 5lsddf qyutxwr l4buafm 0rd33cs
For pretty output, pipe into | column -t -s $'\t'
to get
LogEntryTime nameId PartnerId GroupId compnayId
2021-06-05T15:00:53 07 5lsddf qyutxwr l4buafm 0rd33cs