How to get feature importance in xgboost?
I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore()
, but it returns {}
and my train code is:
dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)
So is there any mistake in my train? How to get feature importance in xgboost?
In your code you can get feature importance for each feature in dict form:
bst.get_score(importance_type='gain')
>>{'ftr_col1': 77.21064539577829,
'ftr_col2': 10.28690566363971,
'ftr_col3': 24.225014841466294,
'ftr_col4': 11.234086283060112}
Explanation: The train() API's method get_score() is defined as:
get_score(fmap='', importance_type='weight')
- fmap (str (optional)) – The name of feature map file.
-
importance_type
- ‘weight’ - the number of times a feature is used to split the data across all trees.
- ‘gain’ - the average gain across all splits the feature is used in.
- ‘cover’ - the average coverage across all splits the feature is used in.
- ‘total_gain’ - the total gain across all splits the feature is used in.
- ‘total_cover’ - the total coverage across all splits the feature is used in.
https://xgboost.readthedocs.io/en/latest/python/python_api.html
Get the table containing scores and feature names, and then plot it.
feature_important = model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())
data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.nlargest(40, columns="score").plot(kind='barh', figsize = (20,10)) ## plot top 40 features
For example:
Using sklearn API and XGBoost >= 0.81:
clf.get_booster().get_score(importance_type="gain")
or
regr.get_booster().get_score(importance_type="gain")
For this to work correctly, when you call regr.fit
(or clf.fit
), X
must be a pandas.DataFrame
.