Spark groupByKey alternative
groupByKey
is fine for the case when we want a "smallish" collection of values per key, as in the question.
TL;DR
The "do not use" warning on groupByKey
applies for two general cases:
1) You want to aggregate over the values:
-
DON'T:
rdd.groupByKey().mapValues(_.sum)
-
DO:
rdd.reduceByKey(_ + _)
In this case, groupByKey
will waste resouces materializing a collection while what we want is a single element as answer.
2) You want to group very large collections over low cardinality keys:
-
DON'T:
allFacebookUsersRDD.map(user => (user.likesCats, user)).groupByKey()
- JUST DON'T
In this case, groupByKey
will potentially result in an OOM error.
groupByKey
materializes a collection with all values for the same key in one executor. As mentioned, it has memory limitations and therefore, other options are better depending on the case.
All the grouping functions, like groupByKey
, aggregateByKey
and reduceByKey
rely on the base: combineByKey
and therefore no other alternative will be better for the usecase in the question, they all rely on the same common process.