Choose between map+inner loop and flatMapValues+reduceByKey

apache-spark

I have a data like following in pairRDD, and I would like to collect a map with username as key, and sum of each list as value. The number of users is very large say 100m+, and lists are <1k in size. There are 2 choices I can think of - mapToPair and sum list with a simple for loop inside mapToPair, or flatMapValues the list to create <user, value> pairs then reduceBykey. Which is way is better?

Seq(
  ("user1",List(8,2,....)),
  ("user2",List(1,12,.....)),
  ...
  ("userN",List(99,5,...))
)

I would guess rdd.mapValues(_.sum) would be faster because you iterate over the elements once instead of twice (once to flatten, once to reduce).

But the best answer would be to just test it an see.

Best tip I can think of though, is try to work with DataFrames or Datasets (Spark SQL) to begin with. If you end up with a flattened DataFrame you can call df.groupBy($"user").agg(F.sum($"value")) or if you have a Dataframe like the RDD you described you can just use the aggregate SQL function

Use of async/wait in REST API with CPU intensive tasks

How to have date in mm/dd/yyyy format on typing in date picker textfield in nuxtjs?

Passing a function with an argument to addEventListener automatically runs the function onload [duplicate]

flutter vs android native in battery consumption & apk size?

trying to close a workbook after it gets opened, but the loop failes to do it

How to bypass human verification 'press and hold' using Selenium in Python?

How to build Qt 6.2.2 from source on Windows 10

Convert MathML to MathType in MS Word

Apply the Herfindahl-Hirschman Index function to a group of rows for an individual in R

R - Call a function from function name that is stored in a variable?

NestJS lazy loading a module importing TypeORM doesn't register "Connection" providers

Differences between Numpy divide and Python divide?

Choose between map+inner loop and flatMapValues+reduceByKey

Related

Recent Posts