Select the top N values by group

This seems more straightforward using data.table as it performs the sort while setting the key.

So, if I were to get the top 3 records in sort (ascending order), then,

require(data.table)
d <- data.table(mtcars, key="cyl")
d[, head(.SD, 3), by=cyl]

does it.

And if you want the descending order

d[, tail(.SD, 3), by=cyl] # Thanks @MatthewDowle

Edit: To sort out ties using mpg column:

d <- data.table(mtcars, key="cyl")
d.out <- d[, .SD[mpg %in% head(sort(unique(mpg)), 3)], by=cyl]

#     cyl  mpg  disp  hp drat    wt  qsec vs am gear carb rank
#  1:   4 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1   11
#  2:   4 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2    1
#  3:   4 21.5 120.1  97 3.70 2.465 20.01  1  0    3    1    8
#  4:   4 21.4 121.0 109 4.11 2.780 18.60  1  1    4    2    6
#  5:   6 18.1 225.0 105 2.76 3.460 20.22  1  0    3    1    7
#  6:   6 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4    1
#  7:   6 17.8 167.6 123 3.92 3.440 18.90  1  0    4    4    2
#  8:   8 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4    7
#  9:   8 10.4 472.0 205 2.93 5.250 17.98  0  0    3    4   14
# 10:   8 10.4 460.0 215 3.00 5.424 17.82  0  0    3    4    5
# 11:   8 13.3 350.0 245 3.73 3.840 15.41  0  0    3    4    3

# and for last N elements, of course it is straightforward
d.out <- d[, .SD[mpg %in% tail(sort(unique(mpg)), 3)], by=cyl]

dplyr does the trick

mtcars %>% 
arrange(desc(mpg)) %>% 
group_by(cyl) %>% slice(1:2)


 mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  33.9     4  71.1    65  4.22 1.835 19.90     1     1     4     1
2  32.4     4  78.7    66  4.08 2.200 19.47     1     1     4     1
3  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
4  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
5  19.2     8 400.0   175  3.08 3.845 17.05     0     0     3     2
6  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2

Just sort by whatever (mpg for example, question is not clear on this)

mt <- mtcars[order(mtcars$mpg), ]

then use the by function to get the top n rows in each group

d <- by(mt, mt["cyl"], head, n=4)

If you want the result to be a data.frame:

Reduce(rbind, d)

Edit: Handling ties is more difficult, but if all ties are desired:

by(mt, mt["cyl"], function(x) x[rank(x$mpg) %in% sort(unique(rank(x$mpg)))[1:4], ])

Another approach is to break ties based on some other information, e.g.,

mt <- mtcars[order(mtcars$mpg, mtcars$hp), ]
by(mt, mt["cyl"], head, n=4)

There are at least 4 ways to do this thing, however,each has some difference. We using u_id to group and using lift value to order/sort

1 dplyr traditional way

library(dplyr)
top10_final_subset1 = final_subset %>% arrange(desc(lift)) %>% group_by(u_id) %>% slice(1:10)

and if you switch the order of arrange(desc(lift)) and group_by(u_id) the result is essential the same.And if there is tie for equal lift value,it will slice to make sure each group has no more than 10 values, if you only have 5 lift value in the group, it will only gives you 5 results for that group.

2 dplyr topN way

library(dplyr)
top10_final_subset2 = final_subset %>% group_by(u_id) %>% top_n(10,lift)

this one if you have tie in lift value, say 15 same lift for the same u_id, you will got all 15 observations

3 data.table tail way

library(data.table)
final_subset = data.table(final_subset,key = "lift")
top10_final_subset3 = final_subset[,tail(.SD,10),,by = c("u_id")]

It has the same row numbers as the first way, however, there are some rows are different, I guess they are using diff random algorithm dealing with tie.

4 data.table .SD way

library(data.table)
top10_final_subset4 = final_subset[,.SD[order(lift,decreasing = TRUE),][1:10],by = "u_id"]

This way is the most "uniform" way,if in a group there are only 5 observation it will repeat value to make it to 10 observations and if there are ties it will still slice and only hold for 10 observations.

If there were a tie at the fourth position for mtcars$mpg then this should return all the ties:

top_mpg <- mtcars[ mtcars$mpg >= mtcars$mpg[order(mtcars$mpg, decreasing=TRUE)][4] , ]

> top_mpg
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

Since there is a tie at the 3-4 position you can test it by changing 4 to a 3, and it still returns 4 items. This is logical indexing and you might need to add a clause that removes the NA's or wrap which() around the logical expression. It's not much more difficult to do this "by" cyl:

 Reduce(rbind,  by(mtcars, mtcars$cyl, 
        function(d) d[ d$mpg >= d$mpg[order(d$mpg, decreasing=TRUE)][4] , ]) )
#-------------
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Fiat 128          32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic       30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla    33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Lotus Europa      30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Ferrari Dino      19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Merc 450SE        16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL        17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Pontiac Firebird  19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

Incorporating my suggestion to @Ista:

Reduce(rbind,  by(mtcars, mtcars$cyl, function(d) d[ d$mpg <= sort( d$mpg )[3] , ]) )

Select the top N values by group

Related

Recent Posts