Understanding data.table invalid .selfref warning
Solution 1:
I just ran your code, and I see the problem. data.table
over-allocates vector of column pointers (for efficiently adding columns by reference later on) and this warning occurs when an operation (most likely inadvertently) removes that over allocation.
Let me try to explain over-allocation using slide 45 from Matt's useR 2014 presentation. The (blue and yellow) boxes on the top correspond to the vector of column pointers and the arrow shows the data each pointer is pointing to.
The figure on the left depicts pictorially how adding (or cbind
ing) a column to a data.frame
works. cbind
ing a column basically results in a (deep or shallow) copy resulting in a new location for the vector of column pointers (shown in yellow) and the data (which has now one more column).
The figure on the right shows the data.table
way, where there are more than 3 blue boxes to begin with, due to over-allocation while data.table
creation. And by using :=
, not even a shallow copy is being made. The vector of column pointers that were there before stay where they are and the next unused over-allocated box is used to assign your new column.
This is about the difference and as to what over-allocation here means.
Now the warning tells that whatever operation you did has removed this over-allocation - meaning the extra blue boxes are gone! So, we can't add columns by reference anymore, until we over-allocate again (which is unnecessary and should be avoided, but since it's already gone, we do what's the next best thing).
My guess is that your dplyr
syntax somehow removes this over-allocation which is caught int the next step when you use :=
and data.table
over-allocates once again before to add new column by reference (which'll result in a shallow copy).
If I do it the data.table
way:
DT <- DT[, list(m=mean(bb)), by=list(dd,aa)]
DT[, ee := 3]
it works just fine.
I don't have the time to look into dplyr
right now to verify or find out what's doing this.