How to extract multiple bits of information that appear on different lines within the same text file

I am trying to extract the sequence ID and cluster number that occur on different lines within the same text file.

The input looks like

>Cluster 72
0   319aa, >O311_01007... *
>Cluster 73
0   318aa, >1494_00753... *
1   318aa, >1621_00002... at 99.69%
2   318aa, >1622_00575... at 99.37%
3   318aa, >1633_00422... at 99.37%
4   318aa, >O136_00307... at 99.69%
>Cluster 74
0   318aa, >O139_01028... *
1   318aa, >O142_00961... at 99.69%
>Cluster 75
0   318aa, >O300_00856... *

The desired output is the sequence ID in one column and the corresponding cluster number in the second.

>O311_01007  72
>1494_00753  73
>1621_00002  73
>1622_00575  73
>1633_00422  73
>O136_00307  73
>O139_01028  74
>O142_00961  74
>O300_00856  75

Can anyone help with this?


With awk:

awk -F '[. ]*' 'NF == 2 {id = $2; next} {print $3, id}' input-file
  • we split fields on spaces or periods with -F '[. ]*'
  • with lines of two fields, (the >Cluster lines), save the second field as the ID and move to the next line
  • with other lines, print the third field and the saved ID

You can use awk for this:

awk '/>Cluster/{
      c=$2;
      next
    }{
      print substr($3,2,length($3)-4), c
    }' file

The first block statement is capturing the cluster ID. The second block statement (the default one) is extracting the wanted data, and print it.


Here's an alternative with Ruby as a one-liner :

ruby -ne 'case $_; when /^>Cluster (\d+)/;id = $1;when /, (>\w{4}_\w{5})\.\.\./;puts "#{$1} #{id}";end' input_file

or spread on multiple lines:

ruby -ne 'case $_
when /^>Cluster (\d+)/
  id = $1
when /, (>\w{4}_\w{5})\.\.\./
  puts "#{$1} #{id}"
end' input_file

I guess it's only more readable than the awk version if you know Ruby and regexen. As a bonus, this code might be a bit more robust than simply splitting the lines, because it looks for the surrounding text.


Perl:

$ perl -ne 'if(/^>.*?(\d+)/){$n=$1;}else{ s/.*(>[^.]+).*/$1 $n/; print}' file 
>O311_01007 72
>1494_00753 73
>1621_00002 73
>1622_00575 73
>1633_00422 73
>O136_00307 73
>O139_01028 74
>O142_00961 74
>O300_00856 75

Explanation

  • perl -ne: read the input file line by line (-n) and apply the script given by -e to each line.
  • if(/^>.*?(\d+)/){$n=$1;} : if this line starts with a >, find the longest stretch of numbers at the end of the line, and save that as $n.
  • else{ s/.*(>[^.]+).*/$1 $n/; print : if the line doesn't start with >, replace everything with the longest stretch of non-. characters following a > (>[^.]+), i.e. the sequence name ($1 because we have captured the regex match) and the current value of $n.

Or, for a more awk-like approach:

$ perl -lane 'if($#F==1){$n=$F[1]}else{$F[2]=~s/\.+$//; print "$F[2] $n"}' file 
>O311_01007 72
>1494_00753 73
>1621_00002 73
>1622_00575 73
>1633_00422 73
>O136_00307 73
>O139_01028 74
>O142_00961 74
>O300_00856 75

This is just a slightly more cumbersome way of doing the same basic idea as the various awk approaches. I am including it for the sake of completion and for the Perl fans. If you need an explanation, just use the awk solutions :).