Using lists inside data.table columns

In data.table is possible to have columns of type list and I'm trying for the first time to benefit from this feature. I need to store for each row of my table dt several comments taken from an rApache web service. Each comment will have a username, datetime, and body item.

Instead of using long strings with some weird, unusual character to separate each message from the others (like |), and a ; to separate each item in a comment, I thought to use lists like this:

library(data.table)
dt <- data.table(id=1:2,
        comment=list(list(
            list(username="michele", date=Sys.time(), message="hello"),
            list(username="michele", date=Sys.time(), message="world")),
          list(
            list(username="michele", date=Sys.time(), message="hello"),
            list(username="michele", date=Sys.time(), message="world"))))

> dt
   id comment
1:  1  <list>
2:  2  <list>

to store all the comments added for one particular row. (also because it will be easier to convert to JSON later on when I need to send it back to the UI)

However, when I try to simulate how I will be actually filling my table during production (adding single comment to a particular row), R either crashes or doesn't assign what I would like and then crashes:

library(data.table)

> library(data.table)
> dt <- data.table(id=1:2, comment=vector(mode="list", length=2))
> dt$comment
[[1]]
NULL

[[2]]
NULL

> dt[1L, comment := 1] # this works
> dt$comment
[[1]]
[1] 1

[[2]]
NULL

> set(dt, 1L, "comment", list(1, "a"))  # assign only `1` and when I try to see `dt` R crashes
Warning message:
In set(dt, 1L, "comment", list(1, "a")) :
  Supplied 2 items to be assigned to 1 items of column 'comment' (1 unused)

> dt[1L, comment := list(1, "a")]       # R crashes as soon as I run
> dt[1L, comment := list(list(1, "a"))] # any of these two

I know I'm trying to misuse data.table, e.g. the way the j argument has been designed allows this:

dt[1L, c("id", "comment") := list(1, "a")] # lists in RHS are seen as different columns! not parts of one

Question: So, is there a way to do the assignment I want? Or I just have to take dt$comment out in a variable, modify it, and then re-assign the whole column every times I need to do an update?


Solution 1:

Using :=:

dt = data.table(id = 1:2, comment = vector("list", 2L))

# assign value 1 to just the first column of 'comment'
dt[1L, comment := 1L]

# assign value of 1 and "a" to rows 1 and 2
dt[, comment := list(1, "a")]

# assign value of "a","b" to row 1, and 1 to row 2 for 'comment'
dt[, comment := list(c("a", "b"), 1)]

# assign list(1, "a") to just 1 row of 'comment'
dt[1L, comment := list(list(list(1, "a")))]

For the last case, you'll need one more list because data.table uses list(.) to look for values to assign to columns by reference.

Using set:

dt = data.table(id = 1:2, comment = vector("list", 2L))

# assign value 1 to just the first column of 'comment'
set(dt, i=1L, j="comment", value=1L)

# assign value of 1 and "a" to rows 1 and 2
set(dt, j="comment", value=list(1, "a"))

# assign value of "a","b" to row 1, and 1 to row 2 for 'comment'
set(dt, j="comment", value=list(c("a", "b"), 1))

# assign list(1, "a") to just 1 row of 'comment'
set(dt, i=1L, j="comment", value=list(list(list(1, "a"))))

HTH


I'm using the current development version 1.9.3, but should just work fine on any other version.

> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.3

loaded via a namespace (and not attached):
[1] plyr_1.8.0.99  reshape2_1.2.2 stringr_0.6.2  tools_3.0.3   

Solution 2:

Just to add more info, what list columns are really designed for is when each cell is itself a vector:

> DT = data.table(a=1:2, b=list(1:5,1:10))
> DT
   a            b
1: 1    1,2,3,4,5
2: 2 1,2,3,4,5,6,

> sapply(DT$b, length)
[1]  5 10 

Notice the pretty printing of the vectors in the b column. Those commas are just for display, each cell is actually a vector (as shown by the sapply command above). Note also the trailing comma on the 2nd item of b. That indicates that the vector is longer than displayed (data.table just displays the first 6 items).

Or, more like your example :

> DT = data.table(id=1:2, comment=list( c("michele", Sys.time(), "hello"),
                                        c("michele", Sys.time(), "world") ))
> DT
   id                       comment
1:  1 michele,1395330180.9278,hello
2:  2 michele,1395330180.9281,world 

What you're trying to do is not only have a list column, but put list into each cell as well, which is why <list> is being displayed. Additionally if you place named lists into each cell then beware that all those names will use up space. Where possible, a list column of vectors may be easier.