How to zip two (or more) DataFrame in Spark

Operation like this is not supported by a DataFrame API. It is possible to zip two RDDs but to make it work you have to match both number of partitions and number of elements per partition. Assuming this is the case:

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructField, StructType, LongType}

val a: DataFrame = sc.parallelize(Seq(
  ("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val b: DataFrame = sc.parallelize(Seq(Tuple1(1), Tuple1(2))).toDF("column_3")

// Merge rows
val rows = a.rdd.zip(b.rdd).map{
  case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}

// Merge schemas
val schema = StructType(a.schema.fields ++ b.schema.fields)

// Create new data frame
val ab: DataFrame = sqlContext.createDataFrame(rows, schema)

If above conditions are not met the only option that comes to mind is adding an index and join:

def addIndex(df: DataFrame) = sqlContext.createDataFrame(
  // Add index
  df.rdd.zipWithIndex.map{case (r, i) => Row.fromSeq(r.toSeq :+ i)},
  // Create schema
  StructType(df.schema.fields :+ StructField("_index", LongType, false))
)

// Add indices
val aWithIndex = addIndex(a)
val bWithIndex = addIndex(b)

// Join and clean
val ab = aWithIndex
  .join(bWithIndex, Seq("_index"))
  .drop("_index")

In Scala's implementation of Dataframes, there is no simple way to concatenate two dataframes into one. We can simply work around this limitation by adding indices to each row of the dataframes. Then, we can do a inner join by these indices. This is my stub code of this implementation:

val a: DataFrame = sc.parallelize(Seq(("abc", 123), ("cde", 23))).toDF("column_1", "column_2")
val aWithId: DataFrame = a.withColumn("id",monotonicallyIncreasingId)

val b: DataFrame = sc.parallelize(Seq((1), (2))).toDF("column_3")
val bWithId: DataFrame = b.withColumn("id",monotonicallyIncreasingId)

aWithId.join(bWithId, "id")

A little light reading - Check out how Python does this!

What about pure SQL ?

SELECT 
    room_name, 
    sender_nickname, 
    message_id, 
    row_number() over (partition by room_name order by message_id) as message_index, 
    row_number() over (partition by room_name, sender_nickname order by message_id) as user_message_index
from messages
order by room_name, message_id

I know the OP was using Scala but if, like me, you need to know how to do this in pyspark then try the Python code below. Like @zero323's first solution it relies on RDD.zip() and will therefore fail if both DataFrames don't have the same number of partitions and the same number of rows in each partition.

from pyspark.sql import Row
from pyspark.sql.types import StructType

def zipDataFrames(left, right):
    CombinedRow = Row(*left.columns + right.columns)

    def flattenRow(row):
        left = row[0]
        right = row[1]
        combinedVals = [left[col] for col in left.__fields__] + [right[col] for col in right.__fields__]
        return CombinedRow(*combinedVals)

    zippedRdd = left.rdd.zip(right.rdd).map(lambda row: flattenRow(row))        
    combinedSchema = StructType(left.schema.fields + right.schema.fields)        
    return zippedRdd.toDF(combinedSchema)

joined = zipDataFrames(a, b)

How to zip two (or more) DataFrame in Spark

Related

Recent Posts