How to filter for columns using awk while keeping header line?

I have a file that looks like this. The file has 26 columns and ~4000 rows. I want to filter to keep the header (id to pc15) as well as filter to obtain the columns $1, $2, $12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26.

**id    Id  Study   Site    CancerType  Sex Country unexpected_duplicates   close_relatives genomic_sex CallPP  pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10    pc11    pc12    pc13    pc14    pc15**
468768  1032    Req Karlsruhe_VN    Breast  Female  Germany NA  NA  XX  0.999919340995249   -0.0731995  -0.0180998  -0.598532   0.0465712   0.152631    1.3425  -0.716615   -1.15831    -0.477422   0.429214    -0.5249 -0.793306   0.274061    0.608845    0.0224554
468769  1405    Req Santiago    Breast  Female  Spain   NA  NA  XX  0.999925545534076   -1.39583    -0.450994   0.156784    2.28138 -0.259947   2.83107 0.335012    0.632872    1.03957 -0.53202    -0.162737   -0.739506   -0.040795   0.249346    0.279228
468770  1564    Req Santiago    Breast  Female  Spain   NA  NA  XX  0.999935886432121   -0.960775   -0.580039   -0.00959004 2.28675 -0.295607   2.43853 -0.102007   1.01575 -0.083289   1.0861  -1.07338    1.2819  -0.132876   -0.303037   0.9752
468771  1610    Req Barcelona   Breast  Female  Spain   NA  NA  XX  0.999940022791339   -1.32007    -0.852952   -0.0532576  2.52405 -0.189117   3.07359 1.31524 0.637381    -1.36214    -0.0246524  0.708741    0.502428    -0.437373   -0.192966   0.331765
468772  RQ56001-9   Req Maastricht  Prostate    Male    Netherlands NA  NA  XY  0.999952431868993   0.13766 -0.3691 0.420061    -0.490546   0.655668    0.547926    -0.614815   0.62115 0.783559    -0.163262   -0.660511   -1.08647    -0.668259   -0.331539-0.444824

I have used the following command below but I get no output.

awk 'NR==1 {print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26}' PCA.covar.txt > PCA.covar2.txt

Can anyone help me to see what I am doing wrong. Thanks.

output

**id    Id  pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10    pc11    pc12    pc13    pc14    pc15**
468768  1032 -0.0731995 -0.0180998  -0.598532   0.0465712   0.152631    1.3425  -0.716615   -1.15831    -0.477422   0.429214    -0.5249 -0.793306   0.274061    0.608845    0.0224554
468769  1405    -1.39583    -0.450994   0.156784    2.28138 -0.259947   2.83107 0.335012    0.632872    1.03957 -0.53202    -0.162737   -0.739506   -0.040795   0.249346    0.279228
468770  1564 -0.960775  -0.580039   -0.00959004 2.28675 -0.295607   2.43853 -0.102007   1.01575 -0.083289   1.0861  -1.07338    1.2819  -0.132876   -0.303037   0.9752
468771  1610  -1.32007  -0.852952   -0.0532576  2.52405 -0.189117   3.07359 1.31524 0.637381    -1.36214    -0.0246524  0.708741    0.502428    -0.437373   -0.192966   0.331765
468772  RQ56001-9 0.13766   -0.3691 0.420061    -0.490546   0.655668    0.547926    -0.614815   0.62115 0.783559    -0.163262   -0.660511   -1.08647    -0.668259   -0.331539-0.444824

It seems your input file is tab separated. If this is really the case then a simple cut command could do the job:

cut -f1-2,12-26 file >output

did you want something like

awk 'NR==1{print;next;}{print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26}' < input.file >output.file

NR==1{print;next;} will print the first line unchanged then skip to the next input line

{print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26} will print the selected fields for the remaining lines

edit:

if you need the corresponding header to the selected fields then there is no need to do anything special on row 1

awk '{print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26}' < input.file >output.file

I would use GNU AWK for this task following way, let file.txt content be

**id    Id  Study   Site    CancerType  Sex Country unexpected_duplicates   close_relatives genomic_sex CallPP  pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10    pc11    pc12    pc13    pc14    pc15**
468768  1032    Req Karlsruhe_VN    Breast  Female  Germany NA  NA  XX  0.999919340995249   -0.0731995  -0.0180998  -0.598532   0.0465712   0.152631    1.3425  -0.716615   -1.15831    -0.477422   0.429214    -0.5249 -0.793306   0.274061    0.608845    0.0224554
468769  1405    Req Santiago    Breast  Female  Spain   NA  NA  XX  0.999925545534076   -1.39583    -0.450994   0.156784    2.28138 -0.259947   2.83107 0.335012    0.632872    1.03957 -0.53202    -0.162737   -0.739506   -0.040795   0.249346    0.279228
468770  1564    Req Santiago    Breast  Female  Spain   NA  NA  XX  0.999935886432121   -0.960775   -0.580039   -0.00959004 2.28675 -0.295607   2.43853 -0.102007   1.01575 -0.083289   1.0861  -1.07338    1.2819  -0.132876   -0.303037   0.9752
468771  1610    Req Barcelona   Breast  Female  Spain   NA  NA  XX  0.999940022791339   -1.32007    -0.852952   -0.0532576  2.52405 -0.189117   3.07359 1.31524 0.637381    -1.36214    -0.0246524  0.708741    0.502428    -0.437373   -0.192966   0.331765
468772  RQ56001-9   Req Maastricht  Prostate    Male    Netherlands NA  NA  XY  0.999952431868993   0.13766 -0.3691 0.420061    -0.490546   0.655668    0.547926    -0.614815   0.62115 0.783559    -0.163262   -0.660511   -1.08647    -0.668259   -0.331539-0.444824

then

awk 'NR==1{print}NR!=1{print $1,$2,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22,$23,$24,$25,$26}' file.txt

output

**id    Id  Study   Site    CancerType  Sex Country unexpected_duplicates   close_relatives genomic_sex CallPP  pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10    pc11    pc12    pc13    pc14    pc15**
468768 1032 -0.0731995 -0.0180998 -0.598532 0.0465712 0.152631 1.3425 -0.716615 -1.15831 -0.477422 0.429214 -0.5249 -0.793306 0.274061 0.608845 0.0224554
468769 1405 -1.39583 -0.450994 0.156784 2.28138 -0.259947 2.83107 0.335012 0.632872 1.03957 -0.53202 -0.162737 -0.739506 -0.040795 0.249346 0.279228
468770 1564 -0.960775 -0.580039 -0.00959004 2.28675 -0.295607 2.43853 -0.102007 1.01575 -0.083289 1.0861 -1.07338 1.2819 -0.132876 -0.303037 0.9752
468771 1610 -1.32007 -0.852952 -0.0532576 2.52405 -0.189117 3.07359 1.31524 0.637381 -1.36214 -0.0246524 0.708741 0.502428 -0.437373 -0.192966 0.331765
468772 RQ56001-9 0.13766 -0.3691 0.420061 -0.490546 0.655668 0.547926 -0.614815 0.62115 0.783559 -0.163262 -0.660511 -1.08647 -0.668259 -0.331539-0.444824 

Explanation: for first line (NR==1) print (whole line), for not first lines (NR!=1) print desired columns.

(tested in gawk 4.2.1)


Assumptions:

  • the leading/trailing ** in the header row is OP's attempt to apply the bold setting (ie, the ** do not exist in the file)
  • no header fields contain white space and ...
  • no data fields contain white space ...
  • otherwise OP will need to provide details about the field separator

One awk idea:

awk '
{ printf $1 OFS $2
  for (i=12;i<=26;i++)
      printf OFS $i
  printf "\n"
}
' PCA.covar.txt

This generates:

id Id pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10 pc11 pc12 pc13 pc14 pc15
468768 1032 -0.0731995 -0.0180998 -0.598532 0.0465712 0.152631 1.3425 -0.716615 -1.15831 -0.477422 0.429214 -0.5249 -0.793306 0.274061 0.608845 0.0224554
468769 1405 -1.39583 -0.450994 0.156784 2.28138 -0.259947 2.83107 0.335012 0.632872 1.03957 -0.53202 -0.162737 -0.739506 -0.040795 0.249346 0.279228
468770 1564 -0.960775 -0.580039 -0.00959004 2.28675 -0.295607 2.43853 -0.102007 1.01575 -0.083289 1.0861 -1.07338 1.2819 -0.132876 -0.303037 0.9752
468771 1610 -1.32007 -0.852952 -0.0532576 2.52405 -0.189117 3.07359 1.31524 0.637381 -1.36214 -0.0246524 0.708741 0.502428 -0.437373 -0.192966 0.331765
468772 RQ56001-9 0.13766 -0.3691 0.420061 -0.490546 0.655668 0.547926 -0.614815 0.62115 0.783559 -0.163262 -0.660511 -1.08647 -0.668259 -0.331539 -0.444824