More than one hour to execute pyspark.sql.DataFrame.take(4)
Solution 1:
While Spark supports a limited predicate pushdown over JDBC all other operations, like limit, group, aggregations are performed internally. Unfortunately it means that take(4)
will fetch data first and then apply the limit
. In other words your database will execute (assuming no projections an filters) something equivalent to:
SELECT * FROM table
and the rest will handled by Spark. There are some optimizations involved (in particular Spark evaluates partitions iteratively to obtain number of records requested by LIMIT
) but it still quite inefficient process compared to database-side optimizations.
If you want to push limit
to the database you'll have to do it statically using subquery as a dbtable
parameter:
(sqlContext.read.format('jdbc')
.options(url='xxxx', dbtable='(SELECT * FROM xxx LIMIT 4) tmp', ....))
sqlContext.read.format("jdbc").options(Map(
"url" -> "xxxx",
"dbtable" -> "(SELECT * FROM xxx LIMIT 4) tmp",
))
Please note that an alias in subquery is mandatory.
Note:
This behavior may be improved in the future, once Data Source API v2 is ready:
- SPARK-15689
- SPIP: Data Source API V2