How to filter for columns using awk while keeping header line?

I have a file that looks like this. The file has 26 columns and ~4000 rows. I want to filter to keep the header (id to pc15) as well as filter to obtain the columns $1, $2, $12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26.

**id    Id  Study   Site    CancerType  Sex Country unexpected_duplicates   close_relatives genomic_sex CallPP  pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10    pc11    pc12    pc13    pc14    pc15**
468768  1032    Req Karlsruhe_VN    Breast  Female  Germany NA  NA  XX  0.999919340995249   -0.0731995  -0.0180998  -0.598532   0.0465712   0.152631    1.3425  -0.716615   -1.15831    -0.477422   0.429214    -0.5249 -0.793306   0.274061    0.608845    0.0224554
468769  1405    Req Santiago    Breast  Female  Spain   NA  NA  XX  0.999925545534076   -1.39583    -0.450994   0.156784    2.28138 -0.259947   2.83107 0.335012    0.632872    1.03957 -0.53202    -0.162737   -0.739506   -0.040795   0.249346    0.279228
468770  1564    Req Santiago    Breast  Female  Spain   NA  NA  XX  0.999935886432121   -0.960775   -0.580039   -0.00959004 2.28675 -0.295607   2.43853 -0.102007   1.01575 -0.083289   1.0861  -1.07338    1.2819  -0.132876   -0.303037   0.9752
468771  1610    Req Barcelona   Breast  Female  Spain   NA  NA  XX  0.999940022791339   -1.32007    -0.852952   -0.0532576  2.52405 -0.189117   3.07359 1.31524 0.637381    -1.36214    -0.0246524  0.708741    0.502428    -0.437373   -0.192966   0.331765
468772  RQ56001-9   Req Maastricht  Prostate    Male    Netherlands NA  NA  XY  0.999952431868993   0.13766 -0.3691 0.420061    -0.490546   0.655668    0.547926    -0.614815   0.62115 0.783559    -0.163262   -0.660511   -1.08647    -0.668259   -0.331539-0.444824

I have used the following command below but I get no output.

awk 'NR==1 {print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26}' PCA.covar.txt > PCA.covar2.txt

Can anyone help me to see what I am doing wrong. Thanks.

output

**id    Id  pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10    pc11    pc12    pc13    pc14    pc15**
468768  1032 -0.0731995 -0.0180998  -0.598532   0.0465712   0.152631    1.3425  -0.716615   -1.15831    -0.477422   0.429214    -0.5249 -0.793306   0.274061    0.608845    0.0224554
468769  1405    -1.39583    -0.450994   0.156784    2.28138 -0.259947   2.83107 0.335012    0.632872    1.03957 -0.53202    -0.162737   -0.739506   -0.040795   0.249346    0.279228
468770  1564 -0.960775  -0.580039   -0.00959004 2.28675 -0.295607   2.43853 -0.102007   1.01575 -0.083289   1.0861  -1.07338    1.2819  -0.132876   -0.303037   0.9752
468771  1610  -1.32007  -0.852952   -0.0532576  2.52405 -0.189117   3.07359 1.31524 0.637381    -1.36214    -0.0246524  0.708741    0.502428    -0.437373   -0.192966   0.331765
468772  RQ56001-9 0.13766   -0.3691 0.420061    -0.490546   0.655668    0.547926    -0.614815   0.62115 0.783559    -0.163262   -0.660511   -1.08647    -0.668259   -0.331539-0.444824

It seems your input file is tab separated. If this is really the case then a simple cut command could do the job:

cut -f1-2,12-26 file >output

did you want something like

awk 'NR==1{print;next;}{print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26}' < input.file >output.file

NR==1{print;next;} will print the first line unchanged then skip to the next input line

{print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26} will print the selected fields for the remaining lines

edit:

if you need the corresponding header to the selected fields then there is no need to do anything special on row 1

awk '{print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26}' < input.file >output.file

I would use GNU AWK for this task following way, let file.txt content be

**id    Id  Study   Site    CancerType  Sex Country unexpected_duplicates   close_relatives genomic_sex CallPP  pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10    pc11    pc12    pc13    pc14    pc15**
468768  1032    Req Karlsruhe_VN    Breast  Female  Germany NA  NA  XX  0.999919340995249   -0.0731995  -0.0180998  -0.598532   0.0465712   0.152631    1.3425  -0.716615   -1.15831    -0.477422   0.429214    -0.5249 -0.793306   0.274061    0.608845    0.0224554
468769  1405    Req Santiago    Breast  Female  Spain   NA  NA  XX  0.999925545534076   -1.39583    -0.450994   0.156784    2.28138 -0.259947   2.83107 0.335012    0.632872    1.03957 -0.53202    -0.162737   -0.739506   -0.040795   0.249346    0.279228
468770  1564    Req Santiago    Breast  Female  Spain   NA  NA  XX  0.999935886432121   -0.960775   -0.580039   -0.00959004 2.28675 -0.295607   2.43853 -0.102007   1.01575 -0.083289   1.0861  -1.07338    1.2819  -0.132876   -0.303037   0.9752
468771  1610    Req Barcelona   Breast  Female  Spain   NA  NA  XX  0.999940022791339   -1.32007    -0.852952   -0.0532576  2.52405 -0.189117   3.07359 1.31524 0.637381    -1.36214    -0.0246524  0.708741    0.502428    -0.437373   -0.192966   0.331765
468772  RQ56001-9   Req Maastricht  Prostate    Male    Netherlands NA  NA  XY  0.999952431868993   0.13766 -0.3691 0.420061    -0.490546   0.655668    0.547926    -0.614815   0.62115 0.783559    -0.163262   -0.660511   -1.08647    -0.668259   -0.331539-0.444824

then

awk 'NR==1{print}NR!=1{print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26}' file.txt

output

**id    Id  Study   Site    CancerType  Sex Country unexpected_duplicates   close_relatives genomic_sex CallPP  pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10    pc11    pc12    pc13    pc14    pc15**
468768 1032 -0.0731995 -0.0180998 -0.598532 0.0465712 0.152631 1.3425 -0.716615 -1.15831 -0.477422 0.429214 -0.5249 -0.793306 0.274061 0.608845 0.0224554
468769 1405 -1.39583 -0.450994 0.156784 2.28138 -0.259947 2.83107 0.335012 0.632872 1.03957 -0.53202 -0.162737 -0.739506 -0.040795 0.249346 0.279228
468770 1564 -0.960775 -0.580039 -0.00959004 2.28675 -0.295607 2.43853 -0.102007 1.01575 -0.083289 1.0861 -1.07338 1.2819 -0.132876 -0.303037 0.9752
468771 1610 -1.32007 -0.852952 -0.0532576 2.52405 -0.189117 3.07359 1.31524 0.637381 -1.36214 -0.0246524 0.708741 0.502428 -0.437373 -0.192966 0.331765
468772 RQ56001-9 0.13766 -0.3691 0.420061 -0.490546 0.655668 0.547926 -0.614815 0.62115 0.783559 -0.163262 -0.660511 -1.08647 -0.668259 -0.331539-0.444824

Explanation: for first line (NR==1) print (whole line), for not first lines (NR!=1) print desired columns.

(tested in gawk 4.2.1)

Assumptions:

the leading/trailing ** in the header row is OP's attempt to apply the bold setting (ie, the ** do not exist in the file)
no header fields contain white space and ...
no data fields contain white space ...
otherwise OP will need to provide details about the field separator

One awk idea:

awk '
{ printf $1 OFS $2
  for (i=12;i<=26;i++)
      printf OFS $i
  printf "\n"
}
' PCA.covar.txt

This generates:

id Id pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10 pc11 pc12 pc13 pc14 pc15
468768 1032 -0.0731995 -0.0180998 -0.598532 0.0465712 0.152631 1.3425 -0.716615 -1.15831 -0.477422 0.429214 -0.5249 -0.793306 0.274061 0.608845 0.0224554
468769 1405 -1.39583 -0.450994 0.156784 2.28138 -0.259947 2.83107 0.335012 0.632872 1.03957 -0.53202 -0.162737 -0.739506 -0.040795 0.249346 0.279228
468770 1564 -0.960775 -0.580039 -0.00959004 2.28675 -0.295607 2.43853 -0.102007 1.01575 -0.083289 1.0861 -1.07338 1.2819 -0.132876 -0.303037 0.9752
468771 1610 -1.32007 -0.852952 -0.0532576 2.52405 -0.189117 3.07359 1.31524 0.637381 -1.36214 -0.0246524 0.708741 0.502428 -0.437373 -0.192966 0.331765
468772 RQ56001-9 0.13766 -0.3691 0.420061 -0.490546 0.655668 0.547926 -0.614815 0.62115 0.783559 -0.163262 -0.660511 -1.08647 -0.668259 -0.331539 -0.444824

How to filter for columns using awk while keeping header line?

Related

Recent Posts