Spark unionAll multiple dataframes

Solution 1:

For pyspark you can do the following:

from functools import reduce
from pyspark.sql import DataFrame

dfs = [df1,df2,df3]
df = reduce(DataFrame.unionAll, dfs)

It's also worth nothing that the order of the columns in the dataframes should be the same for this to work. This can silently give unexpected results if you don't have the correct column orders!!

If you are using pyspark 2.3 or greater, you can use unionByName so you don't have to reorder the columns.

Solution 2:

The simplest solution is to reduce with union (unionAll in Spark < 2.0):

val dfs = Seq(df1, df2, df3)
dfs.reduce(_ union _)

This is relatively concise and shouldn't move data from off-heap storage ~~but extends lineage with each union~~ requires non-linear time to perform plan analysis. what can be a problem if you try to merge large number of DataFrames.

You can also convert to RDDs and use SparkContext.union:

dfs match {
  case h :: Nil => Some(h)
  case h :: _   => Some(h.sqlContext.createDataFrame(
                     h.sqlContext.sparkContext.union(dfs.map(_.rdd)),
                     h.schema
                   ))
  case Nil  => None
}

It keeps ~~lineage short~~ analysis cost low but otherwise it is less efficient than merging DataFrames directly.

What are the (dis)advantages of using Cassini instead of IIS?

What are the PowerShell equivalents of Bash's && and || operators?

\u200b (Zero width space) characters in my JS code. Where did they come from?

Javascript: convert 24-hour time-of-day string to 12-hour time with AM/PM and no timezone

Any way to return PHP `json_encode` with encode UTF-8 and not Unicode? [duplicate]

How does Java 8' new default interface model works (incl. diamond, multiple inheritance, and precedence)?

Why doesn't exec work in a function with a subfunction?

What effects does using a binary collation have?

Can I remove the URL from my print css, so the web address doesn't print?

CSS Page-Break Not Working in all Browsers

How to call an external program in python and retrieve the output and return code?