Reading DataFrame from partitioned parquet file
How to read partitioned parquet with condition as dataframe,
this works fine,
val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=25/*")
Partition is there for day=1 to day=30
is it possible to read something like(day = 5 to 6)
or day=5,day=6
,
val dataframe = sqlContext.read.parquet("file:///home/msoproj/dev_data/dev_output/aln/partitions/data=jDD/year=2015/month=10/day=??/*")
If I put *
it gives me all 30 days data and it too big.
sqlContext.read.parquet
can take multiple paths as input. If you want just day=5
and day=6
, you can simply add two paths like:
val dataframe = sqlContext
.read.parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/",
"file:///your/path/data=jDD/year=2015/month=10/day=6/")
If you have folders under day=X
, like say country=XX
, country
will automatically be added as a column in the dataframe
.
EDIT: As of Spark 1.6 one needs to provide a "basepath"-option in order for Spark to generate columns automatically. In Spark 1.6.x the above would have to be re-written like this to create a dataframe with the columns "data", "year", "month" and "day":
val dataframe = sqlContext
.read
.option("basePath", "file:///your/path/")
.parquet("file:///your/path/data=jDD/year=2015/month=10/day=5/",
"file:///your/path/data=jDD/year=2015/month=10/day=6/")
If you want to read for multiple days, for example day = 5
and day = 6
and want to mention the range in the path itself, wildcards can be used:
val dataframe = sqlContext
.read
.parquet("file:///your/path/data=jDD/year=2015/month=10/day={5,6}/*")
Wildcards can also be used to specify a range of days:
val dataframe = sqlContext
.read
.parquet("file:///your/path/data=jDD/year=2015/month=10/day=[5-10]/*")
This matches all days from 5 to 10.
you need to provide mergeSchema = true
option. like mentioned below (this is from 1.6.0):
val dataframe = sqlContext.read.option("mergeSchema", "true").parquet("file:///your/path/data=jDD")
This will read all the parquet files into dataframe and also creates columns year, month and day in the dataframe data.
Ref: https://spark.apache.org/docs/1.6.0/sql-programming-guide.html#schema-merging