How to execute both parallel and serial transformations with sklearn pipeline?
The fact is that ColumnTransformer
applies its transformers in parallel to the dataset you're passing to it. Therefore if you're adding the transformer which standardizes your numeric data as the second step in your transformers list, this won't apply on the output of the imputation, but rather on the initial dataset.
One possibility to solve such problem is to enclose the transformations on the numeric columns in a Pipeline
.
preprocessor = ColumnTransformer([
('num_pipe', Pipeline([('numeric_imputation', NumericImputation()),
('standardizer', YourStandardizer())]), dq.numeric_variables),
('onehot', OneHotEncoder(handle_unknown="ignore"), dq.categorical_variables)],
remainder = 'passthrough')
I would suggest you the following posts on a similar topic:
- ColumnTransformer & Pipeline with OHE - Is the OHE encoded field retained or removed after ct is performed?
- Pipeline with SimpleImputer and OneHotEncoder - how to do properly?
(you'll find some others linked within them).