Add count of unique / distinct values by group to the original data

Here's a solution with the dplyr package - it has n_distinct() as a wrapper for length(unique()).

df %>%
  group_by(color) %>%
  mutate(unique_types = n_distinct(type))

Using ave (since you ask for it specifically):

within(df, { count <- ave(type, color, FUN=function(x) length(unique(x)))})

Make sure that type is character vector and not factor.

Since you also say your data is huge and that speed/performance may therefore be a factor, I'd suggest a data.table solution as well.

require(data.table)
setDT(df)[, count := uniqueN(type), by = color] # v1.9.6+
# if you don't want df to be modified by reference
ans = as.data.table(df)[, count := uniqueN(type), by = color]

uniqueN was implemented in v1.9.6 and is a faster equivalent of length(unique(.)). In addition it also works with data.frames/data.tables.

Other solutions:

Using plyr:

require(plyr)
ddply(df, .(color), mutate, count = length(unique(type)))

Using aggregate:

agg <- aggregate(data=df, type ~ color, function(x) length(unique(x)))
merge(df, agg, by="color", all=TRUE)

This can be also achieved in a vectorized without by group operations by combining unique with table or tabulate

If df$color is factor, then

Either

table(unique(df)$color)[as.character(df$color)]
# black black black green green   red   red  blue  blue  blue 
#    2     2     2     1     1     2     2     3     3     3

tabulate(unique(df)$color)[as.integer(df$color)]
# [1] 2 2 2 1 1 2 2 3 3 3

If df$color is character then just

table(unique(df)$color)[df$color]

If df$color is an integer then just

tabulate(unique(df)$color)[df$color]

Add count of unique / distinct values by group to the original data

Related

Recent Posts