How to remove rows with similar data to keep only highest value in a specific column (tsv file) with awk in bash?
I have a very large .tsv
file (80 GB) that I need to edit. It is made up of 5 columns. The last column represent a score. Some positions have multiple "Score" entries and I need to keep only the row for each position with the highest value.
For example, this position have multiple entries for each combination:
1 861265 C A 0.071
1 861265 C A 0.148
1 861265 C G 0.001
1 861265 C G 0.108
1 861265 C T 0
1 861265 C T 0.216
2 193456 G A 0.006
2 193456 G A 0.094
2 193456 G C 0.011
2 193456 G C 0.152
2 193456 G T 0.003
2 193456 G T 0.056
The desired output would look like this:
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
Doing it in python/pandas is not possible as the file is too large or takes too long. Therefore, I am looking for a solution using bash
; in particular awk
.
Thif input file has been sorted with the following command:
sort -t$'\t' -k1 -n -o sorted_file original_file
The command would basically need to:
- compare the data from the first 4 columns in the
sorted_file
- if all of those are the same, then only the row with the highest value on column 5 should be printed onto the output file`.
I am not very familiar with
awk
syntax. I have seen relatively similar questions in other forums, but I was unable to adapt it to my particular case. I have tried to adapt one of those solutions to my case like this:
awk -F, 'NR==1 {print; next} NR==2 {key=$2; next}$2 != key {print lastval; key = $2} {lastval = $0} END {print lastval}' sorted_files.tsv > filtered_file.tsv
However, the output file does not look like it should, at all. Any help would be very much appreciated.
Solution 1:
a more robust way is to sort the last field numerically and let awk
pick the first value. If your fields don't have spaces, no need to specify the delimiter.
$ sort -k1n k5,5nr original_file | awk '!a[$1,$2,$3,$4]++' > max_value_file
As @Fravadona commented, since this stores the keys, if there are many unique records it will have large memory footprint. One alterative is delegating to uniq
to pick the first record over repeated entries.
$ sort -k1n k5,5nr original_file |
awk '{print $5,$1,$2,$3,$4}' |
uniq -f1 |
awk '{print $2,$3,$4,$5,$1}'
we change the order of the fields to skip the value for comparison and then change back afterwards. This won't have any memory footprint (aside from sort
, which will be managed).
If you're not a purist, this should work the same as the previous one
$ sort -k1n k5,5nr original_file | rev | uniq -f1 | rev
Solution 2:
It's not awk, but using Miller, is very easy and interesting
mlr --tsv -N sort -f 1,2,3,4 -n 5 then top -f 5 -g 1,2,3,4 -a input.tsv >output.tsv
You will have
1 861265 C A 1 0.148
1 861265 C G 1 0.108
1 861265 C T 1 0.216
2 193456 G A 1 0.094
2 193456 G C 1 0.152
2 193456 G T 1 0.056
Solution 3:
You can try this approach. This also works on a non-sorted last column, only the first 4 columns have to be sorted.
% awk 'NR>1&&str!=$1" "$2" "$3" "$4{print line; m=0}
$5>=m{m=$5; line=$0}
{str=$1" "$2" "$3" "$4} END{print line}' file
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
Data
% cat file
1 861265 C A 0.071
1 861265 C A 0.148
1 861265 C G 0.001
1 861265 C G 0.108
1 861265 C T 0
1 861265 C T 0.216
2 193456 G A 0.006
2 193456 G A 0.094
2 193456 G C 0.011
2 193456 G C 0.152
2 193456 G T 0.003
2 193456 G T 0.056
Solution 4:
Assumptions/Understandings:
- file is sorted by the first field
- no guarantee on the ordering of fields #2, #3 and #4
- must maintain the current row ordering (this would seem to rule out (re)sorting the file as we could lose the current row ordering)
- the complete set of output rows for a given
group
will fit into memory (aka theawk
arrays)
General plan:
- we'll call field #1 the
group
field; all rows with the same value in field #1 are considered part of the samegroup
- for a given
group
we keep track of all output rows via theawk
arrayarr[]
(index will be a combo of fields #2, #3, #4) - we also keep track of the incoming row order via the
awk
arrayorder[]
- update
arr[]
if we see a value in field #5 that's higher than the previous value - when
group
changes flush the current contents of thearr[]
index to stdout
One awk
idea:
awk '
function flush() { # function to flush current group to stdout
for (i=1; i<=seq; i++)
print group,order[i],arr[order[i]]
delete arr # reset arrays
delete order
seq=0 # reset index for order[] array
}
BEGIN { FS=OFS="\t" }
$1!=group { flush()
group=$1
}
{ key=$2 OFS $3 OFS $4
if ( key in arr && $5 <= arr[key] )
next
if ( ! (key in arr) )
order[++seq]=key
arr[key]=$5
}
END { flush() } # flush last group to stdout
' input.dat
This generates:
1 861265 C A 0.148
1 861265 C G 0.108
1 861265 C T 0.216
2 193456 G A 0.094
2 193456 G C 0.152
2 193456 G T 0.056
Solution 5:
Updated
Extract from the sort
manual:
-k, --key=KEYDEF
KEYDEF isF[.C][OPTS][,F[.C][OPTS]]
for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end.
It means that by using sort -t$'\t' -k1 -n
like you did, all the fields of the file have contributed to the numerical sorting.
Here's probably the fastest awk
solution that makes use of the numerical sorting in ascending order:
awk '
BEGIN {
FS = "\t"
if ((getline line) > 0) {
split(line, arr)
prev_key = arr[1] FS arr[2] FS arr[4]
prev_line = $0
}
}
{
curr_key = $1 FS $2 FS $4
if (curr_key != prev_key) {
print prev_line
prev_key = curr_key
}
prev_line = $0
}
END {
if (prev_key) print prev_line
}
' file.tsv
Note: As you're handling a file that has around 4 billions of lines, I tried to keep the number of operations to a minimum. For example:
-
Saving 80 billions operations just by setting
FS
to"\t"
. Indeed, why would you allowawk
to compare each character of the file with" "
when you're dealing with a TSV? -
Saving 4 billions comparisons by processing the first line with
getline
in theBEGIN
block. Some people might say that it's safer/better/cleaner to use(NR == 1)
and/or(NR > 1)
, but that would mean doing 2 comparisons per input line instead of 0.
It may be worth to compare the execution time of this code with @EdMorton's code that uses the same algorithm without those optimisations. The disk speed will probably flatten the difference though ^^