How to filter for columns using awk while keeping header line?
I have a file that looks like this. The file has 26 columns and ~4000 rows. I want to filter to keep the header (id to pc15) as well as filter to obtain the columns $1, $2, $12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26.
**id Id Study Site CancerType Sex Country unexpected_duplicates close_relatives genomic_sex CallPP pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10 pc11 pc12 pc13 pc14 pc15**
468768 1032 Req Karlsruhe_VN Breast Female Germany NA NA XX 0.999919340995249 -0.0731995 -0.0180998 -0.598532 0.0465712 0.152631 1.3425 -0.716615 -1.15831 -0.477422 0.429214 -0.5249 -0.793306 0.274061 0.608845 0.0224554
468769 1405 Req Santiago Breast Female Spain NA NA XX 0.999925545534076 -1.39583 -0.450994 0.156784 2.28138 -0.259947 2.83107 0.335012 0.632872 1.03957 -0.53202 -0.162737 -0.739506 -0.040795 0.249346 0.279228
468770 1564 Req Santiago Breast Female Spain NA NA XX 0.999935886432121 -0.960775 -0.580039 -0.00959004 2.28675 -0.295607 2.43853 -0.102007 1.01575 -0.083289 1.0861 -1.07338 1.2819 -0.132876 -0.303037 0.9752
468771 1610 Req Barcelona Breast Female Spain NA NA XX 0.999940022791339 -1.32007 -0.852952 -0.0532576 2.52405 -0.189117 3.07359 1.31524 0.637381 -1.36214 -0.0246524 0.708741 0.502428 -0.437373 -0.192966 0.331765
468772 RQ56001-9 Req Maastricht Prostate Male Netherlands NA NA XY 0.999952431868993 0.13766 -0.3691 0.420061 -0.490546 0.655668 0.547926 -0.614815 0.62115 0.783559 -0.163262 -0.660511 -1.08647 -0.668259 -0.331539-0.444824
I have used the following command below but I get no output.
awk 'NR==1 {print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26}' PCA.covar.txt > PCA.covar2.txt
Can anyone help me to see what I am doing wrong. Thanks.
output
**id Id pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10 pc11 pc12 pc13 pc14 pc15**
468768 1032 -0.0731995 -0.0180998 -0.598532 0.0465712 0.152631 1.3425 -0.716615 -1.15831 -0.477422 0.429214 -0.5249 -0.793306 0.274061 0.608845 0.0224554
468769 1405 -1.39583 -0.450994 0.156784 2.28138 -0.259947 2.83107 0.335012 0.632872 1.03957 -0.53202 -0.162737 -0.739506 -0.040795 0.249346 0.279228
468770 1564 -0.960775 -0.580039 -0.00959004 2.28675 -0.295607 2.43853 -0.102007 1.01575 -0.083289 1.0861 -1.07338 1.2819 -0.132876 -0.303037 0.9752
468771 1610 -1.32007 -0.852952 -0.0532576 2.52405 -0.189117 3.07359 1.31524 0.637381 -1.36214 -0.0246524 0.708741 0.502428 -0.437373 -0.192966 0.331765
468772 RQ56001-9 0.13766 -0.3691 0.420061 -0.490546 0.655668 0.547926 -0.614815 0.62115 0.783559 -0.163262 -0.660511 -1.08647 -0.668259 -0.331539-0.444824
It seems your input file is tab separated. If this is really the case then a simple cut
command could do the job:
cut -f1-2,12-26 file >output
did you want something like
awk 'NR==1{print;next;}{print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26}' < input.file >output.file
NR==1{print;next;}
will print the first line unchanged then skip to the next input line
{print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26}
will print the selected fields for the remaining lines
edit:
if you need the corresponding header to the selected fields then there is no need to do anything special on row 1
awk '{print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26}' < input.file >output.file
I would use GNU AWK
for this task following way, let file.txt
content be
**id Id Study Site CancerType Sex Country unexpected_duplicates close_relatives genomic_sex CallPP pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10 pc11 pc12 pc13 pc14 pc15**
468768 1032 Req Karlsruhe_VN Breast Female Germany NA NA XX 0.999919340995249 -0.0731995 -0.0180998 -0.598532 0.0465712 0.152631 1.3425 -0.716615 -1.15831 -0.477422 0.429214 -0.5249 -0.793306 0.274061 0.608845 0.0224554
468769 1405 Req Santiago Breast Female Spain NA NA XX 0.999925545534076 -1.39583 -0.450994 0.156784 2.28138 -0.259947 2.83107 0.335012 0.632872 1.03957 -0.53202 -0.162737 -0.739506 -0.040795 0.249346 0.279228
468770 1564 Req Santiago Breast Female Spain NA NA XX 0.999935886432121 -0.960775 -0.580039 -0.00959004 2.28675 -0.295607 2.43853 -0.102007 1.01575 -0.083289 1.0861 -1.07338 1.2819 -0.132876 -0.303037 0.9752
468771 1610 Req Barcelona Breast Female Spain NA NA XX 0.999940022791339 -1.32007 -0.852952 -0.0532576 2.52405 -0.189117 3.07359 1.31524 0.637381 -1.36214 -0.0246524 0.708741 0.502428 -0.437373 -0.192966 0.331765
468772 RQ56001-9 Req Maastricht Prostate Male Netherlands NA NA XY 0.999952431868993 0.13766 -0.3691 0.420061 -0.490546 0.655668 0.547926 -0.614815 0.62115 0.783559 -0.163262 -0.660511 -1.08647 -0.668259 -0.331539-0.444824
then
awk 'NR==1{print}NR!=1{print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26}' file.txt
output
**id Id Study Site CancerType Sex Country unexpected_duplicates close_relatives genomic_sex CallPP pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10 pc11 pc12 pc13 pc14 pc15**
468768 1032 -0.0731995 -0.0180998 -0.598532 0.0465712 0.152631 1.3425 -0.716615 -1.15831 -0.477422 0.429214 -0.5249 -0.793306 0.274061 0.608845 0.0224554
468769 1405 -1.39583 -0.450994 0.156784 2.28138 -0.259947 2.83107 0.335012 0.632872 1.03957 -0.53202 -0.162737 -0.739506 -0.040795 0.249346 0.279228
468770 1564 -0.960775 -0.580039 -0.00959004 2.28675 -0.295607 2.43853 -0.102007 1.01575 -0.083289 1.0861 -1.07338 1.2819 -0.132876 -0.303037 0.9752
468771 1610 -1.32007 -0.852952 -0.0532576 2.52405 -0.189117 3.07359 1.31524 0.637381 -1.36214 -0.0246524 0.708741 0.502428 -0.437373 -0.192966 0.331765
468772 RQ56001-9 0.13766 -0.3691 0.420061 -0.490546 0.655668 0.547926 -0.614815 0.62115 0.783559 -0.163262 -0.660511 -1.08647 -0.668259 -0.331539-0.444824
Explanation: for first line (NR==1
) print
(whole line), for not first lines (NR!=1
) print
desired columns.
(tested in gawk 4.2.1)
Assumptions:
- the leading/trailing
**
in the header row is OP's attempt to apply the bold setting (ie, the**
do not exist in the file) - no header fields contain white space and ...
- no data fields contain white space ...
- otherwise OP will need to provide details about the field separator
One awk
idea:
awk '
{ printf $1 OFS $2
for (i=12;i<=26;i++)
printf OFS $i
printf "\n"
}
' PCA.covar.txt
This generates:
id Id pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10 pc11 pc12 pc13 pc14 pc15
468768 1032 -0.0731995 -0.0180998 -0.598532 0.0465712 0.152631 1.3425 -0.716615 -1.15831 -0.477422 0.429214 -0.5249 -0.793306 0.274061 0.608845 0.0224554
468769 1405 -1.39583 -0.450994 0.156784 2.28138 -0.259947 2.83107 0.335012 0.632872 1.03957 -0.53202 -0.162737 -0.739506 -0.040795 0.249346 0.279228
468770 1564 -0.960775 -0.580039 -0.00959004 2.28675 -0.295607 2.43853 -0.102007 1.01575 -0.083289 1.0861 -1.07338 1.2819 -0.132876 -0.303037 0.9752
468771 1610 -1.32007 -0.852952 -0.0532576 2.52405 -0.189117 3.07359 1.31524 0.637381 -1.36214 -0.0246524 0.708741 0.502428 -0.437373 -0.192966 0.331765
468772 RQ56001-9 0.13766 -0.3691 0.420061 -0.490546 0.655668 0.547926 -0.614815 0.62115 0.783559 -0.163262 -0.660511 -1.08647 -0.668259 -0.331539 -0.444824