Aggregating by unique identifier and concatenating related values into a string [duplicate]
I have a need that I imagine could be satisfied by aggregate
or reshape
, but I can't quite figure out.
I have a list of names (brand
), and accompanying ID number (id
). This data is in long form, so names can have multiple ID's. I'd like to de-dupicate by the name (brand
) and concatenate the multiple possible id
's into a string separated by a comment.
For example:
brand id
RadioShack 2308
Rag & Bone 4466
Ragu 1830
Ragu 4518
Ralph Lauren 1638
Ralph Lauren 2719
Ralph Lauren 2720
Ralph Lauren 2721
Ralph Lauren 2722
should become:
RadioShack 2308
Rag & Bone 4466
Ragu 1830,4518
Ralph Lauren 1638,2719,2720,2721,2722
How would I accomplish this?
Solution 1:
Let's call your data.frame DF
> aggregate(id ~ brand, data = DF, c)
brand id
1 RadioShack 2308
2 Rag & Bone 4466
3 Ragu 1830, 4518
4 Ralph Lauren 1638, 2719, 2720, 2721, 2722
Another alternative using aggregate
is:
result <- aggregate(id ~ brand, data = DF, paste, collapse = ",")
This produces the same result and now id
is not a list
anymore. Thanks to @Frank comment. To see the class
of each column try:
> sapply(result, class)
brand id
"factor" "character"
As mentioned by @DavidArenburg in the comments, another alternative is using the toString
function:
aggregate(id ~ brand, data = DF, toString)
Solution 2:
A nice clean one line in data.table
library(data.table)
setDT(DF)
TWO OPTIONS:
results as a list
DF[ , .(id = list(id)), by = brand]
brand id
1: RadioShack 2308
2: Rag & Bone 4466
3: Ragu 1830,4518
4: Ralph Lauren 1638,2719,2720,2721,2722
>
results as a string
DF[ , .(id = paste(id, collapse=",")), by = brand]
brand id
1: RadioShack 2308
2: Rag & Bone 4466
3: Ragu 1830,4518
4: Ralph Lauren 1638,2719,2720,2721,2722
Note
Even though the two results appear the same (that is when you print them, they look identical), they are in fact very different and allow for different functionality.
Namely, using the list option (the first one) allows you to then perform functions on the orignal id
s.
The latter will allow you to display the information more easily (including exporting to CSV
or excel
), but to operate on the id
's will require splicing them back.