Scikit-Learn - How to get non-normalized importance score from RandomForestRegressor

Solution 1:

The importance is always normalized to sum up to 1, even when you go down to each decision tree regressor, for example:

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.ensemble import RandomForestRegressor

boston = load_boston()

X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = boston.target

X_train, X_test,y_train, y_test = train_test_split(X,y,test_size=0.3)

rf = RandomForestRegressor(n_estimators=100)
rf.fit(X_train, y_train)

You can see within each tree, the importance values are already normalized:

[x.feature_importances_.sum() for x in rf.estimators_]
 
[1.0,
0.9999999999999999,
 1.0,
 1.0,
 0.9999999999999999,
 1.0,
 0.9999999999999998,...

Not so trivial to pull out the decrease in mse for tree and recalculate. One alternative solution would be to use permutation_importance with r squared or rmse as the measure, to get an estimate of how important each feature would be, these values are not normalized:

from sklearn.inspection import permutation_importance
importance_r2 = permutation_importance(rf, X_test, y_test,scoring="r2")
importance_rmse = permutation_importance(rf, X_test,
 y_test,scoring="neg_root_mean_squared_error")

And plotting the results:

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 6))
ax1.boxplot(importance_r2.importances.T, vert=False,labels=X_test.columns)
ax1.set_xlabel('Decrease in R2')
ax2.boxplot(importance_rmse.importances.T, vert=False,labels=X_test.columns)
ax2.set_xlabel('Decrease in RMSE')
fig.tight_layout()

enter image description here