Automatically create formulas for all possible linear models

Say I have a training set in a data frame train with columns ColA, ColB, ColC, etc. One of these columns designates a binary class, say column Class, with "yes" or "no" values.

I'm trying out some binary classifiers, e.g.:

library(klaR)
mynb <- NaiveBayes(Class ~ ColA + ColB + ColC, train)

I would like to run the above code in a loop, automatically generating all possible combinations of columns in the formula, i.e.:

mynb <- append(mynb, NaiveBayes(Class ~ ColA, train)
mynb <- append(mynb, NaiveBayes(Class ~ ColA + ColB, train)
mynb <- append(mynb, NaiveBayes(Class ~ ColA + ColB + ColC, train)
...
mynb <- append(mynb, NaiveBayes(Class ~ ColB + ColC + ColD, train)
...

How can I automatically generate formulas for each possible linear model involving columns of a data frame?


Solution 1:

Say we work with this ridiculous example :

DF <- data.frame(Class=1:10,A=1:10,B=1:10,C=1:10)

Then you get the names of the columns

Cols <- names(DF)
Cols <- Cols[! Cols %in% "Class"]
n <- length(Cols)

You construct all possible combinations

id <- unlist(
        lapply(1:n,
              function(i)combn(1:n,i,simplify=FALSE)
        )
      ,recursive=FALSE)

You paste them to formulas

Formulas <- sapply(id,function(i)
              paste("Class~",paste(Cols[i],collapse="+"))
            )

And you loop over them to apply the models.

lapply(Formulas,function(i)
    lm(as.formula(i),data=DF))

Be warned though: if you have more than a handful columns, this will quickly become very heavy on the memory and result in literally thousands of models. You have 2^n - 1 different models with n being the number of columns.

Make very sure that is what you want, in general this kind of model comparison is strongly advised against. Forget about any kind of inference as well when you do this.

Solution 2:

Here is an excellent blog post by Mark Heckman, detailing how to construct all possible regression models, given a set of explanatory variables and a response variable. However, as pointed out by Joris, I would strictly caution against using such an approach since (a) the number of regressions increases exponentially and (b) statistical experts don't recommend data fishing of this kind, as it is fraught with all kinds of risks.

Solution 3:

vars<-c('a','b','c','d')
library(gregmisc) 
indexes<-unique(apply(combinations(length(vars), length(vars), repeats=T), 1, unique))
gen.form<-function(x) as.formula(paste('~',paste( vars[x],collapse='+')))
formulas<-lapply(indexes, gen.form)
formulas

Generates:

R> formulas

[[1]] ~a

[[2]] ~a + b

[[3]] ~a + c

[[4]] ~a + d

[[5]] ~a + b + c

[[6]] ~a + b + d

[[7]] ~a + c + d

[[8]] ~a + b + c + d

[[9]] ~b

[[10]] ~b + c

[[11]] ~b + d

[[12]] ~b + c + d

[[13]] ~c

[[14]] ~c + d

[[15]] ~d