Scikit learn GridSearchCV with pipeline with custom transformer

To debug searches in general, set error_score='raise', so that you get a full error traceback.

Your issue appears to be data-dependent; I can run this just fine on a custom dataset. That suggests to me that the comment by @Sanjar Adylov not only highlights an important issue, but the issue for your data: the train folds sometimes contain different values in some categorical feature(s) than the test folds, and so the one-hot encodings end up with different numbers of features, and the linear model justifiably breaks.

So the fix there is also as Sanjar says: instantiate, store as attributes, and fit the two transformers and in your fit method, and use their transform methods in your transform method.

You will find there is another big issue: all the scores in cv_results_ are the same. This is because you can't actually set the hyperparameters correctly, because in __init__ you've used mismatching names (degree as the parameter but degree_ as the attribute). Read more in the developer guide. (I think you can get around this by editing set_params similar to how you edited get_params, but it would be much easier to actually rely on the BaseEstimator versions of those and just match the parameter names to the attribute names.)

Also, note that setting a parameter default to a list can have surprising effects. Consider alternatives to the default of poly_features in __init__.

class custom_poly_features(TransformerMixin, BaseEstimator):
    def __init__(self, degree=2, poly_features=['year', 'odometer']):
        self.degree = degree
        self.poly_features = poly_features

    def fit(self, X, y=None):
        self.poly_feat = PolynomialFeatures(degree=self.degree)
        self.onehot = OneHotEncoder(sparse=False)

        self.not_poly_features_ = list(set(X.columns) - set(self.poly_features))

        self.poly_feat.fit(X[self.poly_features])
        self.onehot.fit(X[self.not_poly_features_])

        return self

    def transform(self, X, y=None):
        poly = self.poly_feat.transform(X[self.poly_features])
        poly = np.hstack([poly, self.onehot.transform(X[self.not_poly_features_])
        return poly

There are some additional things you might want to add, like checks for whether poly_features or not_poly_features_ is empty (which would break the corresponding transformer).

Finally, your custom estimator is just doing what a ColumnTransformer is meant to do. I think the only reason to prefer yours is if you need to search over which columns get which treatment; I don't think that's easy to do with a ColumnTransformer.

custom_poly = ColumnTransformer(
    transformers=[('poly', PolynomialFeatures(), ['year', 'odometer'])],
    remainder=OneHotEncoder(),
)

param_grid = {"cpf__poly__degree": [3, 4, 5]}

Scikit learn GridSearchCV with pipeline with custom transformer

Related

Recent Posts