Apply the Herfindahl-Hirschman Index function to a group of rows for an individual in R

Summary: You spelled Percentage incorrectly, although that appears to be from a failure to copy your code precisely. The real problem as you pointed out is that the data.table function is using the entire column of Percentage values each time through the by-loop. The correct way to refer to a by-constructed subset of data is with the .SD (Subset-of-Data) construct.

Here's the MCVE

library(hhi)
 
 df <- read.table(text="ID  Percentage
 1  50
 1  50
 2  25
 2  20
 2  45
 2  10", head=T)

library(data.table)

setDT(df)
df[,hhi(df, "percentage"), ID]
#------------------
Error in `[.data.frame`(x, i, j) : undefined columns selected
Error in `[.data.frame`(x, i, j) : undefined columns selected
In addition: Warning message:
In hhi(df, "percentage") : shares, "s", do not sum to 100
#-----------------
df[,hhi(df, "Percentage"), ID]  # correct spelling
   ID   V1
1:  1 8150
2:  2 8150
Warning messages:
1: In hhi(df, "Percentage") : shares, "s", do not sum to 100
2: In hhi(df, "Percentage") : shares, "s", do not sum to 100

That is apparently what you are seeing and it is because you have not correctly told the [.data.table function that the df is that same df as is being evaluated by subset. To do that correctly you need to use the .SD self-(subset)referential operation.

df[,hhi(.SD, "Percentage"), by=ID]

#-----------
   ID   V1
1:  1 5000
2:  2 3150    # no warnings, more sensible indices of concentration

It's interesting to compare a base version of this operation to the data.table and another poster's dplyr version. I happen to think that as far as elegance goes, the winner is base-R although there definitely is a motivation for learning the somewhat idiosyncratic, and sometimes elegant, syntax of the [.data.table function for it's speed and efficiency (lower memory footprint) for large datasets.

lapply( split(df, df$ID), hhi, s="Percentage")
$`1`
[1] 5000

$`2`
[1] 3150