Concatenate strings by group with dplyr [duplicate]
i have a dataframe that looks like this
> data <- data.frame(foo=c(1, 1, 2, 3, 3, 3), bar=c('a', 'b', 'a', 'b', 'c', 'd'))
> data
foo bar
1 1 a
2 1 b
3 2 a
4 3 b
5 3 c
6 3 d
I would like to create a new column bars_by_foo which is the concatenation of the values of bar by foo. So the new data should look like this:
foo bar bars_by_foo
1 1 a ab
2 1 b ab
3 2 a a
4 3 b bcd
5 3 c bcd
6 3 d bcd
I was hoping that the following would work:
p <- function(v) {
Reduce(f=paste, x = v)
}
data %>%
group_by(foo) %>%
mutate(bars_by_foo=p(bar))
But that code gives me an error
Error: incompatible types, expecting a character vector
.
What am I doing wrong?
Solution 1:
You could simply do
data %>%
group_by(foo) %>%
mutate(bars_by_foo = paste0(bar, collapse = ""))
Without any helper functions
Solution 2:
It looks like there's a bit of an issue with the mutate
function - I've found that it's a better approach to work with summarise
when you're grouping data in dplyr (that's no way a hard and fast rule though).
paste
function also introduces whitespace into the result so either set sep = 0
or use just use paste0
.
Here is my code:
p <- function(v) {
Reduce(f=paste0, x = v)
}
data %>%
group_by(foo) %>%
summarise(bars_by_foo = p(as.character(bar))) %>%
merge(., data, by = 'foo') %>%
select(foo, bar, bars_by_foo)
Resulting in..
foo bar bars_by_foo
1 1 a ab
2 1 b ab
3 2 a a
4 3 b bcd
5 3 c bcd
6 3 d bcd
Solution 3:
You can try this:
agg <- aggregate(bar~foo, data = data, paste0, collapse="")
df <- merge(data, agg, by = "foo", all = T)
colnames(df) <- c(colnames(data), "bars_by_foo") # optional
# foo bar bars_by_foo
# 1 1 a ab
# 2 1 b ab
# 3 2 a a
# 4 3 b bcd
# 5 3 c bcd
# 6 3 d bcd
Solution 4:
Your function works if you ensure that bar are all characters and not levels of a factor.
data <- data.frame(foo=c(1, 1, 2, 3, 3, 3), bar=c('a', 'b', 'a', 'b', 'c', 'd'),
stringsAsFactors = FALSE)
library("dplyr")
p <- function(v) {
Reduce(f=paste, x = v)
}
data %>%
group_by(foo) %>%
mutate(bars_by_foo=p(bar))
Source: local data frame [6 x 3]
Groups: foo [3]
foo bar bars_by_foo
<dbl> <chr> <chr>
1 1 a a b
2 1 b a b
3 2 a a
4 3 b b c d
5 3 c b c d
6 3 d b c d