Filter file when a column is in whitelist

There are many ways of doing this. Here are a few but use the Perl one it is orders of magnitude faster. I include the others for the sake of completeness:

Perl and hashes, ridiculously fast

perl -e 'open(A,"fileB"); while(<A>){chomp; $k{$_}++} 
 while(<>){@a=split(/,/); print if defined $k{$a[0]}}' fileA

gawk and associative arrays, much slower

 gawk '{if(FILENAME==ARGV[1]){array[$1]=1}
  else{if($1 in array){print}}}' fileA fileB

grep, ridiculously slow. You will need to modify your fileB slightly to make the patterns match only on the first line
```
sed 's/\(.\)/^\1/' fileB > fileC
grep -f fileC fileA  
```

I created a couple of test files and it turns out that the Perl solutions is much faster than the grep:

$ head -2 fileA
GO:0032513_GO:0050129
GO:0050129_GO:0051712
$ head -2 fileB
GO:0032513_GO:0050129
GO:0050129_GO:0051712
$ wc -l fileA fileB
  1500000 fileA
 20000000 fileB
$ time perl -e 'open(A,"fileB"); while(<A>){chomp; $k{$_}++} 
 while(<>){@a=split(/,/); print if defined $k{$a[0]}}' fileA > /dev/null 

real    0m41.354s
user    0m37.370s
sys     0m3.960s
$ time gawk '{if(FILENAME==ARGV[1]){array[$1]=1}
   else{if($1 in array){print}}}' fileA fileB

real    2m30.963s
user    1m23.857s
sys     0m9.385s
$ time (join -t, <(sort -n fileA) <(sort -n fileB) >/dev/null)

real    8m29.532s
user    13m52.576s
sys     1m22.029s

So, the Perl scriptlet can go through a 20 million line file looking for 1.5 million patterns and finish in ~40 seconds. Not bad. The other two are much slower, gawk took 2.5 minutes and the grep one has been running for more than 15. Perl wins hands down.

This should do the trick:

join -t, <(sort A) <(sort B)

Filter file when a column is in whitelist

Related

Recent Posts