Apache Spark: Get number of records per partition
Solution 1:
I'd use built-in function. It should be as efficient as it gets:
import org.apache.spark.sql.functions.spark_partition_id
df.groupBy(spark_partition_id).count
Solution 2:
You can get the number of records per partition like this :
df
.rdd
.mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}
.toDF("partition_number","number_of_records")
.show
But this will also launch a Spark Job by itself (because the file must be read by spark to get the number of records).
Spark could may also read hive table statistics, but I don't know how to display those metadata..