PySpark RDD | partitionBy method
partitionBy(~) method re-partitions a pair RDD into the desired number of partitions.
The desired number of partitions of the resulting RDD.
The partitioning function - the input is the key and the return value must be the hashed value. By default, a hash partitioner will be used.
A PySpark RDD (
Repartitioning a pair RDD
Consider the following RDD:
To see how this RDD is partitioned, use the
We can indeed see that there are 3 partitions:
To re-partition into 2 partitions:
To see the contents of the new partitions:
We can indeed see that there are 2 partitions:
Notice how the tuple with the key
A has ended up in the same partition. This is guaranteed to happen because the hash partitioner will perform bucketing based on the tuple key.