How to get most informative features for scikit-learn classifiers?
Solution 1:
The classifiers themselves do not record feature names, they just see numeric arrays. However, if you extracted your features using a Vectorizer
/CountVectorizer
/TfidfVectorizer
/DictVectorizer
, and you are using a linear model (e.g. LinearSVC
or Naive Bayes) then you can apply the same trick that the document classification example uses. Example (untested, may contain a bug or two):
def print_top10(vectorizer, clf, class_labels):
"""Prints features with the highest coefficient values, per class"""
feature_names = vectorizer.get_feature_names()
for i, class_label in enumerate(class_labels):
top10 = np.argsort(clf.coef_[i])[-10:]
print("%s: %s" % (class_label,
" ".join(feature_names[j] for j in top10)))
This is for multiclass classification; for the binary case, I think you should use clf.coef_[0]
only. You may have to sort the class_labels
.
Solution 2:
With the help of larsmans code I came up with this code for the binary case:
def show_most_informative_features(vectorizer, clf, n=20):
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
for (coef_1, fn_1), (coef_2, fn_2) in top:
print "\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2)
Solution 3:
To add an update, RandomForestClassifier
now supports the .feature_importances_
attribute. This attribute tells you how much of the observed variance is explained by that feature. Obviously, the sum of all these values must be <= 1.
I find this attribute very useful when performing feature engineering.
Thanks to the scikit-learn team and contributors for implementing this!
edit: This works for both RandomForest and GradientBoosting. So RandomForestClassifier
, RandomForestRegressor
, GradientBoostingClassifier
and GradientBoostingRegressor
all support this.
Solution 4:
We've recently released a library (https://github.com/TeamHG-Memex/eli5) which allows to do that: it handles variuos classifiers from scikit-learn, binary / multiclass cases, allows to highlight text according to feature values, integrates with IPython, etc.