Apache Spark how to append new column from list/array to Spark dataframe

I am using Apache Spark 2.0 Dataframe/Dataset API I want to add a new column to my dataframe from List of values. My list has same number of values like given dataframe.

val list = List(4,5,10,7,2)
val df   = List("a","b","c","d","e").toDF("row1")

I would like to do something like:

val appendedDF = df.withColumn("row2",somefunc(list))
df.show()
// +----+------+
// |row1 |row2 |
// +----+------+
// |a    |4    |
// |b    |5    |
// |c    |10   |
// |d    |7    |
// |e    |2    |
// +----+------+

For any ideas I would be greatful, my dataframe in reality contains more columns.

Solution 1:

You could do it like this:

import org.apache.spark.sql.Row
import org.apache.spark.sql.types._    

// create rdd from the list
val rdd = sc.parallelize(List(4,5,10,7,2))
// rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:28

// zip the data frame with rdd
val rdd_new = df.rdd.zip(rdd).map(r => Row.fromSeq(r._1.toSeq ++ Seq(r._2)))
// rdd_new: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[33] at map at <console>:32

// create a new data frame from the rdd_new with modified schema
spark.createDataFrame(rdd_new, df.schema.add("new_col", IntegerType)).show
+----+-------+
|row1|new_col|
+----+-------+
|   a|      4|
|   b|      5|
|   c|     10|
|   d|      7|
|   e|      2|
+----+-------+

Solution 2:

Adding for completeness: the fact that the input list (which exists in driver memory) has the same size as the DataFrame suggests that this is a small DataFrame to begin with - so you might consider collect()-ing it, zipping with list, and converting back into a DataFrame if needed:

df.collect()
  .map(_.getAs[String]("row1"))
  .zip(list).toList
  .toDF("row1", "row2")

That won't be faster, but if the data is really small it might be negligible and the code is (arguably) clearer.

Apache Spark how to append new column from list/array to Spark dataframe

Solution 1:

Solution 2:

Related

Recent Posts