How to remove rows with similar data to keep only highest value in a specific column (tsv file) with awk in bash?

I have a very large .tsv file (80 GB) that I need to edit. It is made up of 5 columns. The last column represent a score. Some positions have multiple "Score" entries and I need to keep only the row for each position with the highest value.

For example, this position have multiple entries for each combination:

1   861265  C   A   0.071
1   861265  C   A   0.148
1   861265  C   G   0.001
1   861265  C   G   0.108
1   861265  C   T   0
1   861265  C   T   0.216
2   193456  G   A   0.006
2   193456  G   A   0.094
2   193456  G   C   0.011
2   193456  G   C   0.152
2   193456  G   T   0.003
2   193456  G   T   0.056

The desired output would look like this:

1   861265  C   A   0.148
1   861265  C   G   0.108
1   861265  C   T   0.216
2   193456  G   A   0.094
2   193456  G   C   0.152
2   193456  G   T   0.056

Doing it in python/pandas is not possible as the file is too large or takes too long. Therefore, I am looking for a solution using bash; in particular awk.

Thif input file has been sorted with the following command:

sort -t$'\t' -k1 -n -o sorted_file original_file

The command would basically need to:

compare the data from the first 4 columns in the sorted_file
if all of those are the same, then only the row with the highest value on column 5 should be printed onto the output file`. I am not very familiar with awk syntax. I have seen relatively similar questions in other forums, but I was unable to adapt it to my particular case. I have tried to adapt one of those solutions to my case like this:

awk -F, 'NR==1 {print; next} NR==2 {key=$2; next}$2 != key {print lastval; key = $2} {lastval = $0} END {print lastval}' sorted_files.tsv > filtered_file.tsv

However, the output file does not look like it should, at all. Any help would be very much appreciated.

Solution 1:

a more robust way is to sort the last field numerically and let awk pick the first value. If your fields don't have spaces, no need to specify the delimiter.

$ sort -k1n k5,5nr original_file | awk '!a[$1,$2,$3,$4]++' > max_value_file

As @Fravadona commented, since this stores the keys, if there are many unique records it will have large memory footprint. One alterative is delegating to uniq to pick the first record over repeated entries.

$ sort -k1n k5,5nr original_file |
  awk '{print $5,$1,$2,$3,$4}'   |
  uniq -f1                       |
  awk '{print $2,$3,$4,$5,$1}'

we change the order of the fields to skip the value for comparison and then change back afterwards. This won't have any memory footprint (aside from sort, which will be managed).

If you're not a purist, this should work the same as the previous one

$ sort -k1n k5,5nr original_file | rev | uniq -f1 | rev

Solution 2:

It's not awk, but using Miller, is very easy and interesting

mlr --tsv -N sort -f 1,2,3,4 -n 5 then top -f 5  -g 1,2,3,4 -a input.tsv >output.tsv

You will have

1       861265  C       A       1       0.148
1       861265  C       G       1       0.108
1       861265  C       T       1       0.216
2       193456  G       A       1       0.094
2       193456  G       C       1       0.152
2       193456  G       T       1       0.056

Solution 3:

You can try this approach. This also works on a non-sorted last column, only the first 4 columns have to be sorted.

% awk 'NR>1&&str!=$1" "$2" "$3" "$4{print line; m=0}
       $5>=m{m=$5; line=$0}
       {str=$1" "$2" "$3" "$4} END{print line}' file
1   861265  C   A   0.148
1   861265  C   G   0.108
1   861265  C   T   0.216
2   193456  G   A   0.094
2   193456  G   C   0.152
2   193456  G   T   0.056

Data

% cat file
1   861265  C   A   0.071
1   861265  C   A   0.148
1   861265  C   G   0.001
1   861265  C   G   0.108
1   861265  C   T   0
1   861265  C   T   0.216
2   193456  G   A   0.006
2   193456  G   A   0.094
2   193456  G   C   0.011
2   193456  G   C   0.152
2   193456  G   T   0.003
2   193456  G   T   0.056

Solution 4:

Assumptions/Understandings:

file is sorted by the first field
no guarantee on the ordering of fields #2, #3 and #4
must maintain the current row ordering (this would seem to rule out (re)sorting the file as we could lose the current row ordering)
the complete set of output rows for a given group will fit into memory (aka the awk arrays)

General plan:

we'll call field #1 the group field; all rows with the same value in field #1 are considered part of the same group
for a given group we keep track of all output rows via the awk array arr[] (index will be a combo of fields #2, #3, #4)
we also keep track of the incoming row order via the awk array order[]
update arr[] if we see a value in field #5 that's higher than the previous value
when group changes flush the current contents of the arr[] index to stdout

One awk idea:

awk '
function flush() {                     # function to flush current group to stdout
    for (i=1; i<=seq; i++)
        print group,order[i],arr[order[i]]

    delete arr                         # reset arrays
    delete order
    seq=0                              # reset index for order[] array
}

BEGIN      { FS=OFS="\t" }

$1!=group  { flush()
             group=$1
           }

           { key=$2 OFS $3 OFS $4

             if ( key in arr && $5 <= arr[key] )
                next
             if ( ! (key in arr) )
                order[++seq]=key
             arr[key]=$5
           }

END   { flush() }                      # flush last group to stdout
' input.dat

This generates:

1       861265  C       A       0.148
1       861265  C       G       0.108
1       861265  C       T       0.216
2       193456  G       A       0.094
2       193456  G       C       0.152
2       193456  G       T       0.056

Solution 5:

Updated

Extract from the sort manual:

-k, --key=KEYDEF

KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a field number and C a character position in the field; both are origin 1, and the stop position defaults to the line's end.

It means that by using sort -t$'\t' -k1 -n like you did, all the fields of the file have contributed to the numerical sorting.

Here's probably the fastest awk solution that makes use of the numerical sorting in ascending order:

awk '
    BEGIN {
        FS = "\t"
        if ((getline line) > 0) {
            split(line, arr)
            prev_key = arr[1] FS arr[2] FS arr[4]
            prev_line = $0
        }
    }
    {
        curr_key = $1 FS $2 FS $4
        if (curr_key != prev_key) {
            print prev_line
            prev_key = curr_key
        }
        prev_line = $0
    }
    END {
        if (prev_key) print prev_line
    }
' file.tsv

Note: As you're handling a file that has around 4 billions of lines, I tried to keep the number of operations to a minimum. For example:

Saving 80 billions operations just by setting FS to "\t". Indeed, why would you allow awk to compare each character of the file with " " when you're dealing with a TSV?
Saving 4 billions comparisons by processing the first line with getline in the BEGIN block. Some people might say that it's safer/better/cleaner to use (NR == 1) and/or (NR > 1), but that would mean doing 2 comparisons per input line instead of 0.

_{It may be worth to compare the execution time of this code with @EdMorton's code that uses the same algorithm without those optimisations. The disk speed will probably flatten the difference though ^^}