Spark 2.0.x dump a csv file from a dataframe containing one array of type string
I have a dataframe df
that contains one column of type array
df.show()
looks like
|ID|ArrayOfString|Age|Gender|
+--+-------------+---+------+
|1 | [A,B,D] |22 | F |
|2 | [A,Y] |42 | M |
|3 | [X] |60 | F |
+--+-------------+---+------+
I try to dump that df
in a csv file as follow:
val dumpCSV = df.write.csv(path="/home/me/saveDF")
It is not working because of the column ArrayOfString
. I get the error:
CSV data source does not support array string data type
The code works if I remove the column ArrayOfString
. But I need to keep ArrayOfString
!
What would be the best way to dump the csv dataframe including column ArrayOfString (ArrayOfString should be dumped as one column on the CSV file)
Solution 1:
The reason why you are getting this error is that csv file format doesn't support array types, you'll need to express it as a string to be able to save.
Try the following :
import org.apache.spark.sql.functions._
val stringify = udf((vs: Seq[String]) => vs match {
case null => null
case _ => s"""[${vs.mkString(",")}]"""
})
df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
or
import org.apache.spark.sql.Column
def stringify(c: Column) = concat(lit("["), concat_ws(",", c), lit("]"))
df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
Solution 2:
Pyspark implementation.
In this example, change the field column_as_array
to column_as_string
before saving.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def array_to_string(my_list):
return '[' + ','.join([str(elem) for elem in my_list]) + ']'
array_to_string_udf = udf(array_to_string, StringType())
df = df.withColumn('column_as_str', array_to_string_udf(df["column_as_array"]))
Then you can drop the old column (array type) before saving.
df.drop("column_as_array").write.csv(...)