Variable Importance Dummy Variables R

Solution 1:

From the caret documentation, we see that variable importance in linear models corresponds to the absolute value of the t-statistic for each covariate. So, we can manually compute it, as I do in the code below.

lm() automatically converts categorical variables as dummies. So, to get the importance of each covariate, we have to sum over dummies. I did not find a way to automate this, so if you want to apply my solution to a different set of variables, you need to be careful in choosing the items of t.stats to be summed.

Finally, we can use results for plotting. I just used the baseline function for a bar plot, but you can customize it as you want (maybe also using the ggplot2 package for better visualization).

Ps when you provide a reproducible example, remember to load all the needed packages.

Pps summing over dummies may be sensitive to the base level of the dummy we are using (i.e., the level we omit from the regression). I do not know if that could be an issue.

library(AmesHousing)
library(caret)
library(dplyr)

df = data.frame(ames_raw)

# convert characters to factors
df = df%>%mutate_if(is.character, as.factor)

# train and split code from caret datacamp
# Get the number of observations
n_obs <- nrow(df)

# Shuffle row indices: permuted_rows
permuted_rows <- sample(n_obs)

# Randomly order data: 
df_shuffled <- df[permuted_rows, ]

# Identify row to split on: split
split <- round(n_obs * 0.7)

# Create train
train <- df_shuffled[1:split, ]

# Create test
test <- df_shuffled[(split + 1):n_obs, ]

mod_lm <- train(SalePrice ~ Street + Sale.Type,
                data = df,
                method = "lm")

# Manually computing variable importance from t-statistics of the model.
t.stats = coef(summary(mod_lm))[, "t value"]
imp.sale = sum(t.stats[-(1:2)])
imp.street = t.stats[2]

# Plotting.
barplot(c(imp.sale, imp.street), names.arg = c("Sale", "Street"))