Reading parquet files from multiple directories in Pyspark

A little late but I found this while I was searching and it may help someone else...

You might also try unpacking the argument list to spark.read.parquet()

paths=['foo','bar']
df=spark.read.parquet(*paths)

This is convenient if you want to pass a few blobs into the path argument:

basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
       's3://bucket/partition_value1=*/partition_value2=2017-05-*'
      ]
df=spark.read.option("basePath",basePath).parquet(*paths)

This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.

Both the parquetFile method of SQLContext and the parquet method of DataFrameReader take multiple paths. So either of these works:

df = sqlContext.parquetFile('/dir1/dir1_2', '/dir2/dir2_1')

df = sqlContext.read.parquet('/dir1/dir1_2', '/dir2/dir2_1')

In case you have a list of files you can do:

files = ['file1', 'file2',...]
df = spark.read.parquet(*files)

For ORC

spark.read.orc("/dir1/*","/dir2/*")

spark goes inside dir1/ and dir2/ folder and load all the ORC files.

For Parquet,

spark.read.parquet("/dir1/*","/dir2/*")

Reading parquet files from multiple directories in Pyspark

Related

Recent Posts