Melt using patterns when variable names contain string information - avoid coercion to numeric
I am using the patterns()
argument in data.table::melt()
to melt data that has columns that have several easily-defined patterns. It is working, but I'm not seeing how I can create a character index variable instead of the default numeric breakdown.
For example, in data set 'A', the dog and cat column names have numeric suffixes (e.g. 'dog_1', 'cat_2'), which are handled correctly in melt
(see the resulting 'variable' column):
A = data.table(idcol = c(1:5),
dog_1 = c(1:5), cat_1 = c(101:105),
dog_2 = c(6:10), cat_2 = c(106:110),
dog_3 = c(11:15), cat_3 = c(111:115))
head(melt(A, measure = patterns("^dog", "^cat"), value.name = c("dog", "cat")))
idcol variable dog cat
1: 1 1 1 101
2: 2 1 2 102
3: 3 1 3 103
4: 4 1 4 104
5: 5 1 5 105
6: 1 2 6 106
However, in data set 'B', the suffix of dog and cat columns is a string (e.g. 'dog_one', 'cat_two'). Such suffixes are converted to a numeric representation in melt
, see the "variable" column.
B = data.table(idcol = c(1:5),
dog_one = c(1:5), cat_one = c(101:105),
dog_two = c(6:10), cat_two = c(106:110),
dog_three = c(11:15), cat_three = c(111:115))
head(melt(B, measure = patterns("^dog", "^cat"), value.name = c("dog", "cat")))
idcol variable dog cat
1: 1 1 1 101
2: 2 1 2 102
3: 3 1 3 103
4: 4 1 4 104
5: 5 1 5 105
6: 1 2 6 106
How can I fill the "variable" column with the correct string suffixes one/two/three instead of 1/2/3?
Solution 1:
From data.table 1.14.1
(in development; installation), the new function measure
makes it much easier to melt data with concatenated variable names to a desired format (see ?measure
.
The sep
arator argument is used to create different groups of measure.vars
. In the ...
argument, we further specify the fate of the values corresponding to the groups generated by sep
.
In OP, the variable names are of the form species_number
, e.g. dog_one
. Thus, we need two symbols in ...
to specify how groups before and after the sep
arator should be treated, one for the species (dog or cat) and one for the numbers (one-three).
If a symbol in ...
is set to value.name
, then "melt
returns multiple value columns (with names defined by the unique values in that group)". Thus, because you want multiple columns for each species, the first group defined by the separator, the first symbol in ...
should be value.name
.
The second group, after the separator, are the numbers, so this is specified as the second symbol in ...
. We want in a single value column for the numbers, so here we specify the desired column name of the output variable, e.g. "nr".
melt(B, measure.vars = measure(value.name, nr, sep = "_"))
idcol nr dog cat
# 1: 1 one 1 101
# 2: 2 one 2 102
# 3: 3 one 3 103
# 4: 4 one 4 104
# 5: 5 one 5 105
# 6: 1 two 6 106
# 7: 2 two 7 107
# 8: 3 two 8 108
# 9: 4 two 9 109
# 10: 5 two 10 110
# 11: 1 three 11 111
# 12: 2 three 12 112
# 13: 3 three 13 113
# 14: 4 three 14 114
# 15: 5 three 15 115
Pre data.table 1.14.1
There might be easier ways, but this seems to work:
# grab suffixes of 'variable' names
suff <- unique(sub('^.*_', '', names(B[ , -1])))
# suff <- unique(tstrsplit(names(B[, -1]), "_")[[2]])
# melt
B2 <- melt(B, measure = patterns("^dog", "^cat"), value.name = c("dog", "cat"))
# replace factor levels in 'variable' with the suffixes
setattr(B2$variable, "levels", suff)
B2
# idcol variable dog cat
# 1: 1 one 1 101
# 2: 2 one 2 102
# 3: 3 one 3 103
# 4: 4 one 4 104
# 5: 5 one 5 105
# 6: 1 two 6 106
# 7: 2 two 7 107
# 8: 3 two 8 108
# 9: 4 two 9 109
# 10: 5 two 10 110
# 11: 1 three 11 111
# 12: 2 three 12 112
# 13: 3 three 13 113
# 14: 4 three 14 114
# 15: 5 three 15 115
Two related data.table
issues:
melt.data.table should offer variable
to match on the name, rather than the number
FR: expansion of melt functionality for handling names of output.
This is one of the (rare) instances where I believe good'ol base::reshape
is cleaner. Its sep
argument comes in handy here — both the names of the 'value' column and the levels of the 'variable' columns are generated in one go:
reshape(data = B,
varying = names(B[ , -1]),
sep = "_",
direction = "long")