Split text string in a data.table columns

r data.table

I have a script that reads in data from a CSV file into a data.table and then splits the text in one column into several new columns. I am currently using the lapply and strsplit functions to do this. Here's an example:

library("data.table")
df = data.table(PREFIX = c("A_B","A_C","A_D","B_A","B_C","B_D"),
                VALUE  = 1:6)
dt = as.data.table(df)

# split PREFIX into new columns
dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1))
dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2))

dt 
#    PREFIX VALUE PX PY
# 1:    A_B     1  A  B
# 2:    A_C     2  A  C
# 3:    A_D     3  A  D
# 4:    B_A     4  B  A
# 5:    B_C     5  B  C
# 6:    B_D     6  B  D

In the example above the column PREFIX is split into two new columns PX and PY on the "_" character.

Even though this works just fine, I was wondering if there is a better (more efficient) way to do this using data.table. My real datasets have >=10M+ rows, so time/memory efficiency becomes really important.

UPDATE:

Following @Frank's suggestion I created a larger test case and used the suggested commands, but the stringr::str_split_fixed takes a lot longer than the original method.

library("data.table")
library("stringr")
system.time ({
    df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000),
                    VALUE  = rep(1:6, 1000000))
    dt = data.table(df)
})
#   user  system elapsed 
#  0.682   0.075   0.758 

system.time({ dt[, c("PX","PY") := data.table(str_split_fixed(PREFIX,"_",2))] })
#    user  system elapsed 
# 738.283   3.103 741.674 

rm(dt)
system.time ( {
    df = data.table(PREFIX = rep(c("A_B","A_C","A_D","B_A","B_C","B_D"), 1000000),
                     VALUE = rep(1:6, 1000000) )
    dt = as.data.table(df)
})
#    user  system elapsed 
#   0.123   0.000   0.123 

# split PREFIX into new columns
system.time ({
    dt$PX = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 1))
    dt$PY = as.character(lapply(strsplit(as.character(dt$PREFIX), split="_"), "[", 2))
})
#    user  system elapsed 
#  33.185   0.000  33.191

So the str_split_fixed method takes about 20X times longer.

Solution 1:

Update: From version 1.9.6 (on CRAN as of Sep'15), we can use the function tstrsplit() to get the results directly (and in a much more efficient manner):

require(data.table) ## v1.9.6+
dt[, c("PX", "PY") := tstrsplit(PREFIX, "_", fixed=TRUE)]
#    PREFIX VALUE PX PY
# 1:    A_B     1  A  B
# 2:    A_C     2  A  C
# 3:    A_D     3  A  D
# 4:    B_A     4  B  A
# 5:    B_C     5  B  C
# 6:    B_D     6  B  D

tstrsplit() basically is a wrapper for transpose(strsplit()), where transpose() function, also recently implemented, transposes a list. Please see ?tstrsplit() and ?transpose() for examples.

See history for old answers.

Solution 2:

I add answer for someone who do not use data.table v1.9.5 and also want an one line solution.

dt[, c('PX','PY') := do.call(Map, c(f = c, strsplit(PREFIX, '-'))) ]

Could not transfer artifact org.apache.maven.plugins:maven-surefire-plugin:pom:2.7.1 from/to central (http://repo1.maven.org/maven2)

How to use refs in React with Typescript

How to apply "filters" to AVCaptureVideoPreviewLayer

PreferenceActivity Android 4.0 and earlier

Registry key Error: Java version has value '1.8', but '1.7' is required

How to compare Unicode characters that "look alike"?

Generating an array of letters in the alphabet

How to protect the password field in Mongoose/MongoDB so it won't return in a query when I populate collections?

Android add spacing below last element in recyclerview with gridlayoutmanager

java.lang.IllegalStateException: The specified child already has a parent

Why does Maven have such a bad rep? [closed]

What programming practice that you once liked have you since changed your mind about? [closed]

Split text string in a data.table columns

UPDATE:

Solution 1:

Solution 2:

Related

Recent Posts