What is the difference between pipeline and make_pipeline in scikit?
I got this from the sklearn webpage:
-
Pipeline: Pipeline of transforms with a final estimator
-
Make_pipeline: Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor.
But I still do not understand when I have to use each one. Can anyone give me an example?
The only difference is that make_pipeline
generates names for steps automatically.
Step names are needed e.g. if you want to use a pipeline with model selection utilities (e.g. GridSearchCV). With grid search you need to specify parameters for various steps of a pipeline:
pipe = Pipeline([('vec', CountVectorizer()), ('clf', LogisticRegression()])
param_grid = [{'clf__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)
compare it with make_pipeline:
pipe = make_pipeline(CountVectorizer(), LogisticRegression())
param_grid = [{'logisticregression__C': [1, 10, 100, 1000]}
gs = GridSearchCV(pipe, param_grid)
gs.fit(X, y)
So, with Pipeline
:
- names are explicit, you don't have to figure them out if you need them;
- name doesn't change if you change estimator/transformer used in a step, e.g. if you replace LogisticRegression() with LinearSVC() you can still use
clf__C
.
make_pipeline
:
- shorter and arguably more readable notation;
- names are auto-generated using a straightforward rule (lowercase name of an estimator).
When to use them is up to you :) I prefer make_pipeline for quick experiments and Pipeline for more stable code; a rule of thumb: IPython Notebook -> make_pipeline; Python module in a larger project -> Pipeline. But it is certainly not a big deal to use make_pipeline in a module or Pipeline in a short script or a notebook.