Passing categorical data to Sklearn Decision Tree
(This is just a reformat of my comment above from 2016...it still holds true.)
The accepted answer for this question is misleading.
As it stands, sklearn decision trees do not handle categorical data - see issue #5442.
The recommended approach of using Label Encoding converts to integers which the DecisionTreeClassifier()
will treat as numeric. If your categorical data is not ordinal, this is not good - you'll end up with splits that do not make sense.
Using a OneHotEncoder
is the only current valid way, allowing arbitrary splits not dependent on the label ordering, but is computationally expensive.
(..)
Able to handle both numerical and categorical data.
This only means that you can use
- the DecisionTreeClassifier class for classification problems
- the DecisionTreeRegressor class for regression.
In any case you need to one-hot encode categorical variables before you fit a tree with sklearn, like so:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
data = pd.DataFrame()
data['A'] = ['a','a','b','a']
data['B'] = ['b','b','a','b']
data['C'] = [0, 0, 1, 0]
data['Class'] = ['n','n','y','n']
tree = DecisionTreeClassifier()
one_hot_data = pd.get_dummies(data[['A','B','C']],drop_first=True)
tree.fit(one_hot_data, data['Class'])