Default Partitioning Scheme in Spark
Solution 1:
You have to distinguish between two different things:
- partitioning as distributing data between partitions depending on a value of the key which is limited only to the
PairwiseRDDs
(RDD[(T, U)]
). This creates a relationship between partition and the set of keys which can be found on a given partition. -
partitioning as splitting input into multiple partitions where data is simply divided into chunks containing consecutive records to enable distributed computation. Exact logic depends on a specific source but it is either number of records or size of a chunk.
In case of
parallelize
data is evenly distributed between partitions using indices. In case ofHadoopInputFormats
(liketextFile
) it depends on properties likemapreduce.input.fileinputformat.split.minsize
/mapreduce.input.fileinputformat.split.maxsize
.
So default partitioning scheme is simply none because partitioning is not applicable to all RDDs. For operations which require partitioning on a PairwiseRDD
(aggregateByKey
, reduceByKey
etc.) default method is use hash partitioning.