R Error in model.frame.default - Variable lengths differ

Very first post here, also, this is only my second time using R, so please be gentle.

I'm working on a project where I in R am trying to make multiple analyses using the lme4 and lmeTest packages via their lmer function. To do so I have list of the names of the variables i want to analyse that I itterate throug using a for loop. Thus it would look something like this :

list <- MyList of IDs
raw <- My Data File

for (i in list) {
  model <- lmer (`i` ~ Time + SecretorStatus + BioRep + TechRep + (1|Random), data = raw)
  .
  .
  .
  Do something..
  .
  .
}

This however generates this error:

Error in model.frame.default(data = raw, drop.unused.levels = TRUE, formula = paste(i) ~  : 
  variable lengths differ (found for 'Time')

I'm quite sure that the issue is related to the lmer statment, maybe to the i as well as everything works perfectly whenever I manually copy past a value into the place of i. However, as I have a lot of values in "list" I need some kind of loop.

I have googled this quite a lot, and found several answers in here attempting to solve the same/similar issues, but for me they don't work. Below is a list of links to some of the better solutions out there, however they didn't solve my problem.

  • Error in model.frame.default: variable lengths differ
  • ARIMAX with grouped data R
  • https://rstudio-pubs-static.s3.amazonaws.com/63556_e35cc7e2dfb54a5bb551f3fa4b3ec4ae.html

Can anyone offer some insights into this phenomenon?

Edit 1 - @r2evans asked for a reproducible example. This is provided below.

#Packages used
library(Matrix) 
library(lme4)
library(lmerTest)
library(stringr)
library(readr)
#Starting to read data
rootDir <- getwd()
raw <- read.csv(str_c(rootDir, "/P035aForR.csv"), na = c("", "NA", "0"))
list <- read.csv(str_c(rootDir, "/P035aHeaddersForR2.csv"), na = c("", "NA", "0"), header = FALSE)
#Generate dir for output and the dataframe to store the main results
dir.create("Results", showWarnings = TRUE, recursive = FALSE, mode = "0777")
dir.create("Results/Data", showWarnings = TRUE, recursive = FALSE, mode = "0777")
dir.create("Results/Pictures", showWarnings = TRUE, recursive = FALSE, mode = "0777")
dir.create("Results/Pictures/QQPlot", showWarnings = TRUE, recursive = FALSE, mode = "0777")
dir.create("Results/Pictures/RawResiduals", showWarnings = TRUE, recursive = FALSE, mode = "0777")
dir.create("Results/Pictures/PersonResiduals", showWarnings = TRUE, recursive = FALSE, mode = "0777")
Result <- data.frame("","","","","","")
names(Result)<-c("ID","SecretorStatus","Time","BioRep","TechRep","Shapiro") ### <-- Update the headders as needed
Result <- Result[-c(1),]
#Load variables from raw as factors for the analysis
raw$SecretorStatus <- as.factor(raw$SecretorStatus)
raw$Time <- as.factor(raw$Time)
raw$TechRep <- as.factor(raw$TechRep)
raw$BioRep <- as.factor(raw$BioRep)
raw$Random <- as.factor(raw$Random)
for (i in list) {
  model <- lmer (`i` ~ Time + SecretorStatus + BioRep + TechRep + (1|Random), data = raw) 
  anova <- as.data.frame(anova(model, type = 2),) 
  residuals <- residuals(model, "response")
  pvalue <- round(shapiro.test(residuals)$p.value, digits=4 ) 
  out <- as.data.frame(anova(model, type = 2))
  write.table(out, "Results/Data/i.txt", col.names=T, row.names=T, quote=F, sep=",")
  png(Results/Pictures/QQPlot/i.png)
  qqnorm(residuals, sub=paste("p-value of a Shapiro-Wilks test =", pvalue )); qqline(residuals) 
  dev.off()
  Pearson.Residuals <- residuals(model, "pearson")
  Raw.Residuals <- residuals(model, "response")
  Fitted <- fitted(model)
  png(Results/Pictures/RAWResiduals/i.png)
  plot(Fitted, Raw.Residuals, main=i)
  dev.off()
  png(Results/Pictures/PersonResiduals/i.png)
  plot(Fitted,Pearson.Residuals, main=i)
  dev.off()
  tmp <- data.frame(i, pvalue, t(anova[,"Pr(>F)"]))
  names(tmp) <- c("ID", "Shapiro", row.names(anova))
  Result <- rbind(Result, tmp)
}

At this point the error is generated where it says:

Error in model.frame.default(data = raw, drop.unused.levels = TRUE, formula = i ~  : 
  variable lengths differ (found for 'Time')

Regarding the setup of the data, I cannot provide the full dataset (obviously), however a sample that demonstrates the structure is provided below. This is the data for the raw variable.

SecretorStatus  Time    TechRep BioRep  Random  ID1 ID2 ID3 ID4 ID5 ID6 ID7 ID8 ID9 ID10    ID11    ID12    ID13    ID14    ID15    ID16
1   1   1   1   1   23342.99    23342.99    0   0   0   0   0   0   0   0   0   0   102829.8    492252.5    0   924436.3
2   1   2   5   2   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   529782
2   1   1   6   3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   506987.7
2   1   2   6   4   0   0   0   0   0   0   0   0   0   0   0   0   0   48786.41    0   618768.5
1   1   2   1   1   0   0   0   0   0   0   0   0   0   0   0   0   0   414852.1    354153.5    850788.9
1   1   1   2   2   0   0   0   0   0   0   0   0   0   99551.51    0   0   322185.6    0   361100.2    819073.6
1   1   2   2   3   0   0   0   0   0   90194.2 0   0   0   73646.15    0   0   0   398369.2    277569.9    613257.3
1   1   1   3   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
1   1   2   3   1   0   0   0   0   0   0   0   0   0   0   0   0   0   265760.8    0   0
2   1   1   4   2   0   0   0   0   0   0   0   0   0   0   0   0   61351.9 554385.9    0   656984.3
2   1   2   4   3   0   0   0   0   0   0   0   0   0   0   0   0   0   622428.4    0   769227.8
2   1   1   5   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   388584.9
1   2   1   1   1   31454.26    31454.26    0   0   0   0   0   0   0   0   0   0   0   0   0   729234.2
1   2   2   1   2   0   0   0   0   0   0   0   0   0   0   0   0   0   333620.4    0   933046.3
1   2   1   2   3   0   0   0   0   0   0   0   0   0   0   0   0   0   834145.3    0   0
1   2   2   2   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   157152.7
1   2   1   3   1   0   0   0   0   0   0   0   0   0   0   0   0   0   178179.3    0   812282.9
1   2   2   3   2   0   0   0   0   0   86782.91    0   0   0   0   0   0   0   191167  0   663968.9
2   2   1   4   3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   610315.3
2   2   2   4   4   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   339407.1
2   2   1   5   1   0   0   0   0   0   0   0   0   0   213881.1    0   0   0   0   0   298894.5
2   2   2   5   2   0   0   0   0   0   0   0   0   0   81122.63    0   0   0   0   0   170576.6
2   2   1   6   3   0   0   0   0   0   0   0   0   0   53790.86    0   0   0   0   0   205826
2   2   2   6   4   0   0   0   0   0   37900.34    0   0   0   0   0   0   0   0   315754  232529.7

For the data in list, the data structure for the example data above would thus be.

ID1 ID2 ID3 ID4 ID5 ID6 ID7 ID8 ID9 ID10    ID11    ID12    ID13    ID14    ID15    ID16

The two files with data for raw and list are both csv files. Throughout this entire thing only one error is produced (at least in my end). However, if you execute the code twice, it will complain that the folders already exists. However, apart from that it only generates one error. Another detail that I have noted is that the variable i is stuck at ID1 when this error happens, thus it doesn’t browse through the entire list, it fails at the first object. I hope this helps for the clarification. Please let me know if there are any more details that you want/need to reproduce the error.


Solution 1:

It's a little hard to tell from what you've given us, but I think this should do it:

pred_vars <- c("Time", "SecretorStatus", "BioRep", "TechRep", "(1|Random)")
list_of_IDs <- names(raw)[startsWith(names(raw), "ID")]
for (i in list_of_IDs) {
  f <- reformulate(pred_vars, response = i)
  model <- lmer (f, data = raw)
  ## ...
}

you don't really need reformulate, you can also use paste or sprintf or any other string-manipulation machinery to put your formula together as a string and then apply as.formula(), but reformulate is a little bit nicer.

A slightly more efficient solution would use refit():

for (i in list_of_IDs) {
   if (i == list_of_IDs[1]) {
      model <- lmer (ID1 ~ Time + SecretorStatus + BioRep + TechRep + (1|Random), 
                     data = raw)
   } else {
      model <- refit(model, newresp = raw[[i]])
   }
   ## ...

}

As stated in ?refit, "the ‘refit()’ approach should be faster because it bypasses the creation of the model representation and goes directly to the optimization step."