sample rows of subgroups from dataframe with dplyr
Solution 1:
Yes, you can use dplyr elegantly by the function do(). Here is an example:
mtcars %>%
group_by(cyl) %>%
do(sample_n(.,2))
and the results are like this
Source: local data frame [6 x 11]
Groups: cyl
mpg cyl disp hp drat wt qsec vs am gear carb
1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
3 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
4 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
5 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
6 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Update:
The do
function is no longer needed for sample_n
in newer versions of dplyr. Current code for taking a random sample of two rows per group:
mtcars %>%
group_by(cyl) %>%
sample_n(2)
Solution 2:
This is easy to do with data.table, and useful for a big table.
NOTE: As mentioned in the coments by Troy, there is a more effiecient way of doing this using data.table, but i wanted to respect the OP sample function and format in the answer.
require(data.table)
DT <- data.table(x = rnorm(10e6, 100, 50), y = letters)
sampleGroup<-function(df,size) {
df[sample(nrow(df),size=size),]
}
result <- DT[, sampleGroup(.SD, 10), by=y]
print(result)
# y x y
# 1: a 30.11659 m
# 2: a 57.99974 h
# 3: a 58.13634 o
# 4: a 87.28466 x
# 5: a 85.54986 j
# ---
# 256: z 149.85817 d
# 257: z 160.24293 e
# 258: z 26.63071 j
# 259: z 17.00083 t
# 260: z 130.27796 f
system.time(DT[, sampleGroup(.SD, 10), by=y])
# user system elapsed
# 0.66 0.02 0.69
Using the iris dataset:
iris <- data.table(iris)
iris[,sampleGroup(.SD, 10), by=Species]
Solution 3:
That's a good question! Can't see any easy way to do it with the documented syntax for dplyr
but how about this for a workaround?
sampleGroup<-function(df,x=1){
df[
unlist(lapply(attr((df),"indices"),function(r)sample(r,min(length(r),x))))
,]
}
sampleGroup(iris %.% group_by(Species),3)
#Source: local data frame [9 x 5]
#Groups: Species
#
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#39 4.4 3.0 1.3 0.2 setosa
#16 5.7 4.4 1.5 0.4 setosa
#25 4.8 3.4 1.9 0.2 setosa
#51 7.0 3.2 4.7 1.4 versicolor
#62 5.9 3.0 4.2 1.5 versicolor
#59 6.6 2.9 4.6 1.3 versicolor
#148 6.5 3.0 5.2 2.0 virginica
#103 7.1 3.0 5.9 2.1 virginica
#120 6.0 2.2 5.0 1.5 virginica
EDIT - PERFORMANCE COMPARISON
Here's a test against using data.table (both native and with a function call as per the example) for 1m rows, 26 groups.
Native data.table is about 2x as fast as the dplyr workaround and also than data.table call with callout. So probably dplyr / data.table are about the same performance.
Hopefully the dplyr guys will give us some native syntax for sampling soon! (or even better, maybe it's already there)
sampleGroup.dt<-function(df,size) {
df[sample(nrow(df),size=size),]
}
testdata<-data.frame(group=sample(letters,10e5,T),runif(10e5))
dti<-data.table(testdata)
# using the dplyr workaround with external function call
system.time(sampleGroup(testdata %.% group_by(group),10))
#user system elapsed
#0.07 0.00 0.06
#using native data.table
system.time(dti[dti[,list(val=sample(.I,10)),by="group"]$val])
#user system elapsed
#0.04 0.00 0.03
#using data.table with external function call
system.time(dti[, sampleGroup.dt(dti, 10), by=group])
#user system elapsed
#0.06 0.02 0.08