data.frame Group By column [duplicate]
I have a data frame DF.
Say DF is:
A B
1 1 2
2 1 3
3 2 3
4 3 5
5 3 6
Now I want to combine together the rows by the column A and to have the sum of the column B.
For example:
A B
1 1 5
2 2 3
3 3 11
I am doing this currently using an SQL query with the sqldf function. But for some reason it is very slow. Is there any more convenient way to do that? I could do it manually too using a for loop but it is again slow. My SQL query is " Select A,Count(B) from DF group by A".
In general whenever I don't use vectorized operations and I use for loops the performance is extremely slow even for single procedures.
Solution 1:
This is a common question. In base, the option you're looking for is aggregate
. Assuming your data.frame
is called "mydf", you can use the following.
> aggregate(B ~ A, mydf, sum)
A B
1 1 5
2 2 3
3 3 11
I would also recommend looking into the "data.table" package.
> library(data.table)
> DT <- data.table(mydf)
> DT[, sum(B), by = A]
A V1
1: 1 5
2: 2 3
3: 3 11
Solution 2:
Using dplyr
:
require(dplyr)
df <- data.frame(A = c(1, 1, 2, 3, 3), B = c(2, 3, 3, 5, 6))
df %>% group_by(A) %>% summarise(B = sum(B))
## Source: local data frame [3 x 2]
##
## A B
## 1 1 5
## 2 2 3
## 3 3 11
With sqldf
:
library(sqldf)
sqldf('SELECT A, SUM(B) AS B FROM df GROUP BY A')
Solution 3:
I would recommend having a look at the plyr
package.
It might not be as fast as data.table or other packages, but it is quite instructive, especially when starting with R and having to do some data manipulation.
> DF <- data.frame(A = c("1", "1", "2", "3", "3"), B = c(2, 3, 3, 5, 6))
> library(plyr)
> DF.sum <- ddply(DF, c("A"), summarize, B = sum(B))
> DF.sum
A B
1 1 5
2 2 3
3 3 11