Spark-Obtaining file name in RDDs
I am trying to process 4 directories of text files that keep growing every day. What I need to do is, if somebody is trying to search for an invoice number, I should give them the list of files which has it.
I was able to map and reduce the values in text files by loading them as RDD. But how can I obtain the file name and other file attributes?
Solution 1:
Since Spark 1.6 you can combine text
data source and input_file_name
function as follows:
Scala:
import org.apache.spark.sql.functions.input_file_name
val inputPath: String = ???
spark.read.text(inputPath)
.select(input_file_name, $"value")
.as[(String, String)] // Optionally convert to Dataset
.rdd // or RDD
Python:
(Versions before 2.x are buggy and may not preserve names when converted to RDD):
from pyspark.sql.functions import input_file_name
(spark.read.text(input_path)
.select(input_file_name(), "value"))
.rdd)
This can be used with other input formats as well.
Solution 2:
You can try this if you are in pyspark:
test = sc.wholeTextFiles("pathtofile")
you will get a resulting RDD with first element = filepath and second element = content
Solution 3:
If your text files are small enough, you can use SparkContext.wholeTextFiles
which returns an RDD of (filename,content)
.