How to delete a row by reference in data.table?
Solution 1:
Good question. data.table
can't delete rows by reference yet.
data.table
can add and delete columns by reference since it over-allocates the vector of column pointers, as you know. The plan is to do something similar for rows and allow fast insert
and delete
. A row delete would use memmove
in C to budge up the items (in each and every column) after the deleted rows. Deleting a row in the middle of the table would still be quite inefficient compared to a row store database such as SQL, which is more suited for fast insert and delete of rows wherever those rows are in the table. But still, it would be a lot faster than copying a new large object without the deleted rows.
On the other hand, since column vectors would be over-allocated, rows could be inserted (and deleted) at the end, instantly; e.g., a growing time series.
It's filed as an issue: Delete rows by reference.
Solution 2:
the approach that i have taken in order to make memory use be similar to in-place deletion is to subset a column at a time and delete. not as fast as a proper C memmove solution, but memory use is all i care about here. something like this:
DT = data.table(col1 = 1:1e6)
cols = paste0('col', 2:100)
for (col in cols){ DT[, (col) := 1:1e6] }
keep.idxs = sample(1e6, 9e5, FALSE) # keep 90% of entries
DT.subset = data.table(col1 = DT[['col1']][keep.idxs]) # this is the subsetted table
for (col in cols){
DT.subset[, (col) := DT[[col]][keep.idxs]]
DT[, (col) := NULL] #delete
}