How to get feature importance in xgboost?

I'm using xgboost to build a model, and try to find the importance of each feature using get_fscore(), but it returns {}

and my train code is:

dtrain = xgb.DMatrix(X, label=Y)
watchlist = [(dtrain, 'train')]
param = {'max_depth': 6, 'learning_rate': 0.03}
num_round = 200
bst = xgb.train(param, dtrain, num_round, watchlist)

So is there any mistake in my train? How to get feature importance in xgboost?


In your code you can get feature importance for each feature in dict form:

bst.get_score(importance_type='gain')

>>{'ftr_col1': 77.21064539577829,
   'ftr_col2': 10.28690566363971,
   'ftr_col3': 24.225014841466294,
   'ftr_col4': 11.234086283060112}

Explanation: The train() API's method get_score() is defined as:

get_score(fmap='', importance_type='weight')

  • fmap (str (optional)) – The name of feature map file.
  • importance_type
    • ‘weight’ - the number of times a feature is used to split the data across all trees.
    • ‘gain’ - the average gain across all splits the feature is used in.
    • ‘cover’ - the average coverage across all splits the feature is used in.
    • ‘total_gain’ - the total gain across all splits the feature is used in.
    • ‘total_cover’ - the total coverage across all splits the feature is used in.

https://xgboost.readthedocs.io/en/latest/python/python_api.html


Get the table containing scores and feature names, and then plot it.

feature_important = model.get_booster().get_score(importance_type='weight')
keys = list(feature_important.keys())
values = list(feature_important.values())

data = pd.DataFrame(data=values, index=keys, columns=["score"]).sort_values(by = "score", ascending=False)
data.nlargest(40, columns="score").plot(kind='barh', figsize = (20,10)) ## plot top 40 features

For example:

enter image description here


Using sklearn API and XGBoost >= 0.81:

clf.get_booster().get_score(importance_type="gain")

or

regr.get_booster().get_score(importance_type="gain")

For this to work correctly, when you call regr.fit (or clf.fit), X must be a pandas.DataFrame.