How to write the resulting RDD to a csv file in Spark python
I have a resulting RDD labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
. This has output in this format:
[(0.0, 0.08482142857142858), (0.0, 0.11442786069651742),.....]
What I want is to create a CSV file with one column for labels
(the first part of the tuple in above output) and one for predictions
(second part of tuple output). But I don't know how to write to a CSV file in Spark using Python.
How can I create a CSV file with the above output?
Just map
the lines of the RDD (labelsAndPredictions
) into strings (the lines of the CSV) then use rdd.saveAsTextFile()
.
def toCSVLine(data):
return ','.join(str(d) for d in data)
lines = labelsAndPredictions.map(toCSVLine)
lines.saveAsTextFile('hdfs://my-node:9000/tmp/labels-and-predictions.csv')
I know this is an old post. But to help someone searching for the same, here's how I write a two column RDD to a single CSV file in PySpark 1.6.2
The RDD:
>>> rdd.take(5)
[(73342, u'cells'), (62861, u'cell'), (61714, u'studies'), (61377, u'aim'), (60168, u'clinical')]
Now the code:
# First I convert the RDD to dataframe
from pyspark import SparkContext
df = sqlContext.createDataFrame(rdd, ['count', 'word'])
The DF:
>>> df.show()
+-----+-----------+
|count| word|
+-----+-----------+
|73342| cells|
|62861| cell|
|61714| studies|
|61377| aim|
|60168| clinical|
|59275| 2|
|59221| 1|
|58274| data|
|58087|development|
|56579| cancer|
|50243| disease|
|49817| provided|
|49216| specific|
|48857| health|
|48536| study|
|47827| project|
|45573|description|
|45455| applicant|
|44739| program|
|44522| patients|
+-----+-----------+
only showing top 20 rows
Now write to CSV
# Write CSV (I have HDFS storage)
df.coalesce(1).write.format('com.databricks.spark.csv').options(header='true').save('file:///home/username/csv_out')
P.S: I am just a beginner learning from posts here in Stackoverflow. So I don't know whether this is the best way. But it worked for me and I hope it will help someone!