fit_transform() takes 2 positional arguments but 3 were given with LabelBinarizer

The Problem:

The pipeline is assuming LabelBinarizer's fit_transform method is defined to take three positional arguments:

def fit_transform(self, x, y)
    ...rest of the code

while it is defined to take only two:

def fit_transform(self, x):
    ...rest of the code

Possible Solution:

This can be solved by making a custom transformer that can handle 3 positional arguments:

  1. Import and make a new class:

    from sklearn.base import TransformerMixin #gives fit_transform method for free
    class MyLabelBinarizer(TransformerMixin):
        def __init__(self, *args, **kwargs):
            self.encoder = LabelBinarizer(*args, **kwargs)
        def fit(self, x, y=0):
            self.encoder.fit(x)
            return self
        def transform(self, x, y=0):
            return self.encoder.transform(x)
    
  2. Keep your code the same only instead of using LabelBinarizer(), use the class we created : MyLabelBinarizer().


Note: If you want access to LabelBinarizer Attributes (e.g. classes_), add the following line to the fit method:
    self.classes_, self.y_type_, self.sparse_input_ = self.encoder.classes_, self.encoder.y_type_, self.encoder.sparse_input_

I believe your example is from the book Hands-On Machine Learning with Scikit-Learn & TensorFlow. Unfortunately, I ran into this problem, as well. A recent change in scikit-learn (0.19.0) changed LabelBinarizer's fit_transform method. Unfortunately, LabelBinarizer was never intended to work how that example uses it. You can see information about the change here and here.

Until they come up with a solution for this, you can install the previous version (0.18.0) as follows:

$ pip install scikit-learn==0.18.0

After running that, your code should run without issue.

In the future, it looks like the correct solution may be to use a CategoricalEncoder class or something similar to that. They have been trying to solve this problem for years apparently. You can see the new class here and further discussion of the problem here.


I think you are going through the examples from the book: Hands on Machine Learning with Scikit Learn and Tensorflow. I ran into the same problem when going through the example in Chapter 2.

As mentioned by other people, the problem is to do with sklearn's LabelBinarizer. It takes less args in its fit_transform method compared to other transformers in the pipeline. (only y when other transformers normally take both X and y, see here for details). That's why when we run pipeline.fit_transform, we fed more args into this transformer than required.

An easy fix I used is to just use OneHotEncoder and set the "sparse" to False to ensure the output is a numpy array same as the num_pipeline output. (this way you don't need to code up your own custom encoder)

your original cat_pipeline:

cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('label_binarizer', LabelBinarizer())
])

you can simply change this part to:

cat_pipeline = Pipeline([
('selector', DataFrameSelector(cat_attribs)),
('one_hot_encoder', OneHotEncoder(sparse=False))
])

You can go from here and everything should work.


Since LabelBinarizer doesn't allow more than 2 positional arguments you should create your custom binarizer like

class CustomLabelBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, sparse_output=False):
        self.sparse_output = sparse_output
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        enc = LabelBinarizer(sparse_output=self.sparse_output)
        return enc.fit_transform(X)

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

num_pipeline = Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer', Imputer(strategy='median')),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scalar', StandardScaler())
])

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(cat_attribs)),
    ('label_binarizer', CustomLabelBinarizer())
])

full_pipeline = FeatureUnion(transformer_list=[
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline)
])

housing_prepared = full_pipeline.fit_transform(new_housing)

I ran into the same problem and got it working by applying the workaround specified in the book's Github repo.

Warning: earlier versions of the book used the LabelBinarizer class at this point. Again, this was incorrect: just like the LabelEncoder class, the LabelBinarizer class was designed to preprocess labels, not input features. A better solution is to use Scikit-Learn's upcoming CategoricalEncoder class: it will soon be added to Scikit-Learn, and in the meantime you can use the code below (copied from Pull Request #9151).

To save you some grepping here's the workaround, just paste and run it in a previous cell:

# Definition of the CategoricalEncoder class, copied from PR #9151.
# Just run this cell, or copy it to your code, do not try to understand it (yet).

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array
from sklearn.preprocessing import LabelEncoder
from scipy import sparse

class CategoricalEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, encoding='onehot', categories='auto', dtype=np.float64,
                 handle_unknown='error'):
        self.encoding = encoding
        self.categories = categories
        self.dtype = dtype
        self.handle_unknown = handle_unknown

    def fit(self, X, y=None):
        """Fit the CategoricalEncoder to X.
        Parameters
        ----------
        X : array-like, shape [n_samples, n_feature]
            The data to determine the categories of each feature.
        Returns
        -------
        self
        """

        if self.encoding not in ['onehot', 'onehot-dense', 'ordinal']:
            template = ("encoding should be either 'onehot', 'onehot-dense' "
                        "or 'ordinal', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.handle_unknown not in ['error', 'ignore']:
            template = ("handle_unknown should be either 'error' or "
                        "'ignore', got %s")
            raise ValueError(template % self.handle_unknown)

        if self.encoding == 'ordinal' and self.handle_unknown == 'ignore':
            raise ValueError("handle_unknown='ignore' is not supported for"
                             " encoding='ordinal'")

        X = check_array(X, dtype=np.object, accept_sparse='csc', copy=True)
        n_samples, n_features = X.shape

        self._label_encoders_ = [LabelEncoder() for _ in range(n_features)]

        for i in range(n_features):
            le = self._label_encoders_[i]
            Xi = X[:, i]
            if self.categories == 'auto':
                le.fit(Xi)
            else:
                valid_mask = np.in1d(Xi, self.categories[i])
                if not np.all(valid_mask):
                    if self.handle_unknown == 'error':
                        diff = np.unique(Xi[~valid_mask])
                        msg = ("Found unknown categories {0} in column {1}"
                               " during fit".format(diff, i))
                        raise ValueError(msg)
                le.classes_ = np.array(np.sort(self.categories[i]))

        self.categories_ = [le.classes_ for le in self._label_encoders_]

        return self

    def transform(self, X):
        """Transform X using one-hot encoding.
        Parameters
        ----------
        X : array-like, shape [n_samples, n_features]
            The data to encode.
        Returns
        -------
        X_out : sparse matrix or a 2-d array
            Transformed input.
        """
        X = check_array(X, accept_sparse='csc', dtype=np.object, copy=True)
        n_samples, n_features = X.shape
        X_int = np.zeros_like(X, dtype=np.int)
        X_mask = np.ones_like(X, dtype=np.bool)

        for i in range(n_features):
            valid_mask = np.in1d(X[:, i], self.categories_[i])

            if not np.all(valid_mask):
                if self.handle_unknown == 'error':
                    diff = np.unique(X[~valid_mask, i])
                    msg = ("Found unknown categories {0} in column {1}"
                           " during transform".format(diff, i))
                    raise ValueError(msg)
                else:
                    # Set the problematic rows to an acceptable value and
                    # continue `The rows are marked `X_mask` and will be
                    # removed later.
                    X_mask[:, i] = valid_mask
                    X[:, i][~valid_mask] = self.categories_[i][0]
            X_int[:, i] = self._label_encoders_[i].transform(X[:, i])

        if self.encoding == 'ordinal':
            return X_int.astype(self.dtype, copy=False)

        mask = X_mask.ravel()
        n_values = [cats.shape[0] for cats in self.categories_]
        n_values = np.array([0] + n_values)
        indices = np.cumsum(n_values)

        column_indices = (X_int + indices[:-1]).ravel()[mask]
        row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
                                n_features)[mask]
        data = np.ones(n_samples * n_features)[mask]

        out = sparse.csc_matrix((data, (row_indices, column_indices)),
                                shape=(n_samples, indices[-1]),
                                dtype=self.dtype).tocsr()
        if self.encoding == 'onehot-dense':
            return out.toarray()
        else:
            return out