step_pca() arguments are not being applied

I'm new to tidymodels but apparently the step_pca() arguments such as nom_comp or threshold are not being implemented when being trained. as in example below, I'm still getting 4 component despite setting nom_comp = 2.

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
rec <- recipe( ~ ., data = USArrests) %>%
  step_normalize(all_numeric()) %>%
  step_pca(all_numeric(), num_comp = 2)

prep(rec) %>% tidy(number = 2, type = "coef") %>%
  pivot_wider(names_from = component, values_from = value, id_cols = terms)
#> # A tibble: 4 x 5
#>   terms       PC1    PC2    PC3     PC4
#>   <chr>     <dbl>  <dbl>  <dbl>   <dbl>
#> 1 Murder   -0.536  0.418 -0.341  0.649 
#> 2 Assault  -0.583  0.188 -0.268 -0.743 
#> 3 UrbanPop -0.278 -0.873 -0.378  0.134 
#> 4 Rape     -0.543 -0.167  0.818  0.0890

Solution 1:

The full PCA is determined (so you can still compute the variances of each term) and num_comp only specifies how many of the components are retained as predictors. If you want to specify the maximal rank, you can pass that through options:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
rec <- recipe( ~ ., data = USArrests) %>%
    step_normalize(all_numeric()) %>%
    step_pca(all_numeric(), num_comp = 2, options = list(rank. = 2))

prep(rec) %>% tidy(number = 2, type = "coef")
#> # A tibble: 8 × 4
#>   terms     value component id       
#>   <chr>     <dbl> <chr>     <chr>    
#> 1 Murder   -0.536 PC1       pca_AoFOm
#> 2 Assault  -0.583 PC1       pca_AoFOm
#> 3 UrbanPop -0.278 PC1       pca_AoFOm
#> 4 Rape     -0.543 PC1       pca_AoFOm
#> 5 Murder    0.418 PC2       pca_AoFOm
#> 6 Assault   0.188 PC2       pca_AoFOm
#> 7 UrbanPop -0.873 PC2       pca_AoFOm
#> 8 Rape     -0.167 PC2       pca_AoFOm

Created on 2022-01-12 by the reprex package (v2.0.1)

You could also control this via the tol argument from stats::prcomp(), also passed in as an option.

Solution 2:

If you bake the recipe it seems to work as intended but I don't know what you aim to achieve afterward.

library(tidyverse)
library(tidymodels)

USArrests <- USArrests %>% 
  rownames_to_column("Countries")

rec <- 
  recipe( ~ ., data = USArrests) %>%
  step_normalize(all_numeric()) %>%
  step_pca(all_numeric(), num_comp = 2)

prep(rec) %>% 
  bake(new_data = NULL)
#> # A tibble: 50 x 3
#>    Countries       PC1     PC2
#>    <fct>         <dbl>   <dbl>
#>  1 Alabama     -0.976   1.12  
#>  2 Alaska      -1.93    1.06  
#>  3 Arizona     -1.75   -0.738 
#>  4 Arkansas     0.140   1.11  
#>  5 California  -2.50   -1.53  
#>  6 Colorado    -1.50   -0.978 
#>  7 Connecticut  1.34   -1.08  
#>  8 Delaware    -0.0472 -0.322 
#>  9 Florida     -2.98    0.0388
#> 10 Georgia     -1.62    1.27  
#> # ... with 40 more rows

Created on 2022-01-11 by the reprex package (v2.0.1)