How to remove lines with a number less than 60 in column 3?
I have a large file. I need to remove all lines in a file which have a number less than 60 in column 3.
Example file:
35110 Bacteria(100) Proteobacteria(59) Alphaproteobacteria(59)
12713 Bacteria(100) Bacteroidetes(100) Bacteroidia(100)
Desired output:
12713 Bacteria(100) Bacteroidetes(100) Bacteroidia(100)
No need for Gawk extensions:
awk -F '[()]' '$4 >= 60'
Here the awk field tokenizer specified via -F
is a regex set []
: fields get separated by either an opening or closing parenthesis, hence you see the number of your 3rd column is the 4th awk field.
You can use awk
(actually it must be the GNU AWK implementation gawk
, not mawk
which contains fewer features - you might have to install it sudo apt install gawk
) for this job:
gawk '{match($3,/\((.+)\)/,m);if(m[1]>=60){print $0}}' MY_FILE
Now although admittedly this looks like black magic to the untrained eye, the logic is simple:
- For every line, run the stuff inside the outermost curly braces:
- First,
match($3, /\((.+)\)/, m)
matches the regular expression\((.+)\)
(which matches an opening and closing round bracket, storing the content between the brackets as first capture group) against the third column$3
of the processed line of input and stores the resulting match array in the variablem
. - Then, check the condition
if (m[1] >= 60)
i.e. if the value of the first capture group of the match (whatever is between the brackets in the input) is greater or equal to 60. If that is true, do{print $0}
, which simply prints the whole currently processed line.
Here's a perl alternate
perl -alne 'print unless $F[2] =~ /\((\d+)\)$/ && $1 < 60'
- match and capture a parenthesized sequence of decimal digits at the end of the 3rd (zero-indexed) field
- if a match is found, test the captured group's numerical value and print accordingly
Ex.
$ perl -alne 'print unless $F[2] =~ /\((\d+)\)$/ && $1 < 60' file
12713 Bacteria(100) Bacteroidetes(100) Bacteroidia(100)
Note that this implements the logic "remove all lines in a file which have a number less than 60 in column 3" as stated in your question - which is slightly different from printing lines that have a number greater than or equal to 60.
If your files really are comma separated (rather than whitespace delimited as shown in your question), then you will need to change the delimiter i.e.
perl -F, -lne 'print unless $F[2] =~ /\((\d+)\)$/ && $1 < 60'