sklearn.LabelEncoder with never seen before values
LabelEncoder is basically a dictionary. You can extract and use it for future encoding:
from sklearn.preprocessing import LabelEncoder
le = preprocessing.LabelEncoder()
le.fit(X)
le_dict = dict(zip(le.classes_, le.transform(le.classes_)))
Retrieve label for a single new item, if item is missing then set value as unknown
le_dict.get(new_item, '<Unknown>')
Retrieve labels for a Dataframe column:
df[your_col] = df[your_col].apply(lambda x: le_dict.get(x, <unknown_value>))
I ended up switching to Pandas' get_dummies due to this problem of unseen data.
- create the dummies on the training data
dummy_train = pd.get_dummies(train)
- create the dummies in the new (unseen data)
dummy_new = pd.get_dummies(new_data)
- re-index the new data to the columns of the training data, filling the missing values with 0
dummy_new.reindex(columns = dummy_train.columns, fill_value=0)
Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.
I have created a class to support this. If you have a new label comes, this will assign it as unknown class.
from sklearn.preprocessing import LabelEncoder
import numpy as np
class LabelEncoderExt(object):
def __init__(self):
"""
It differs from LabelEncoder by handling new classes and providing a value for it [Unknown]
Unknown will be added in fit and transform will take care of new item. It gives unknown class id
"""
self.label_encoder = LabelEncoder()
# self.classes_ = self.label_encoder.classes_
def fit(self, data_list):
"""
This will fit the encoder for all the unique values and introduce unknown value
:param data_list: A list of string
:return: self
"""
self.label_encoder = self.label_encoder.fit(list(data_list) + ['Unknown'])
self.classes_ = self.label_encoder.classes_
return self
def transform(self, data_list):
"""
This will transform the data_list to id list where the new values get assigned to Unknown class
:param data_list:
:return:
"""
new_data_list = list(data_list)
for unique_item in np.unique(data_list):
if unique_item not in self.label_encoder.classes_:
new_data_list = ['Unknown' if x==unique_item else x for x in new_data_list]
return self.label_encoder.transform(new_data_list)
The sample usage:
country_list = ['Argentina', 'Australia', 'Canada', 'France', 'Italy', 'Spain', 'US', 'Canada', 'Argentina, ''US']
label_encoder = LabelEncoderExt()
label_encoder.fit(country_list)
print(label_encoder.classes_) # you can see new class called Unknown
print(label_encoder.transform(country_list))
new_country_list = ['Canada', 'France', 'Italy', 'Spain', 'US', 'India', 'Pakistan', 'South Africa']
print(label_encoder.transform(new_country_list))