How to correctly select dependent variable and controls in lm() in R?
What you are trying to do is to create a wrapper function for regression. Ideally you would program this so that the function can take in an arbitrary dataset and a specified response variable, control variable and other model terms, and then produce all the models of interest.
Usually we want the user to enter variable names without quotes, and we extract the name of this input as an "unevaluated expression" using deparse(substitute(...))
. For the other model terms in the regression, since there can be arbitrarily many of them, a reasonable input would be a list of character vectors. When you create a wrapper function it will generally construct an "internal call" for the models of interest, which may differ from the standard call for these models. Consequently, you will usually also want to amend the call for the models so that they look like the standard call. So, using this syntax, you would write something like this:
myreg <- function(response, control, other.terms = NULL, data) {
#Get variable names and other terms
DATA.NAME <- deparse(substitute(data))
RESPONSE.NAME <- deparse(substitute(response))
CONTROL.NAME <- deparse(substitute(control))
if (is.null(other.terms)) {
OTHER <- vector(mode = 'list', length = 1) } else {
OTHER <- other.terms }
#Set formula and model objects
m <- length(OTHER)
FORMULAE <- vector(mode = 'list', length = m)
MODELS <- vector(mode = 'list', length = m)
names(MODELS) <- sprintf('MODEL%s', 1:m)
FORM <- paste(RESPONSE.NAME, '~', CONTROL.NAME)
#Fit models
for (i in 1:m) {
#Set the formula
if (is.null(OTHER[[i]])) {
FORMULAE[[i]] <- FORM } else {
FORMULAE[[i]] <- paste(FORM, '+', paste(OTHER[[i]], collapse = '+')) }
#Fit the model and substitute the call
MODELS[[i]] <- lm(formula(FORMULAE[[i]]), data = data)
CALL <- paste0('lm(formula = ', FORMULAE[[i]], ', data = ', DATA.NAME, ')')
MODELS[[i]]$call <- parse(text = CALL)[[1]] }
#Return the models
MODELS }
You can then use the function to produce a list of multiple regression models with the specified variables. Here is an example where you produce three different models, each with the same response and control variable, but with different additional terms in the models:
(MODELS <- myreg(response = hp,
control = cyl,
other.terms = list('mpg', 'wt', c('mpg', 'wt')),
data = mtcars))
$MODEL1
Call:
lm(formula = hp ~ cyl + mpg, data = mtcars)
Coefficients:
(Intercept) cyl mpg
54.067 23.979 -2.775
$MODEL2
Call:
lm(formula = hp ~ cyl + wt, data = mtcars)
Coefficients:
(Intercept) cyl wt
-51.81 31.39 1.33
$MODEL3
Call:
lm(formula = hp ~ cyl + mpg + wt, data = mtcars)
Coefficients:
(Intercept) cyl mpg wt
115.66 25.03 -4.22 -12.13
You should work with string substitution. The snippet below provides a simple overview of how you could adjust your function. It would also be good practice to pass the dataset df
as an additional parameter in your function.
df <- mtcars
## these would be function inputs
dv <- "mpg"
control <- "cyl"
## this would form the function body
tmpl <- "dv ~ control" # create a template formula
tmpl.dv <- gsub("dv", dv, tmpl) # insert the dv
tmpl.dv.control <- gsub("control", control, tmpl.dv) # insert the control
form <- as.formula(tmpl.dv.control) # create the formula to use in lm
## fit the model
mod <- lm(form, data = df)