PySpark RDD | partitionBy method
Start your free 7-days trial now!
PySpark RDD's partitionBy(~) method re-partitions a pair RDD into the desired number of partitions.
Parameters
1. numPartitions | int
The desired number of partitions of the resulting RDD.
2. partitionFunc | function | optional
The partitioning function - the input is the key and the return value must be the hashed value. By default, a hash partitioner will be used.
Return Value
A PySpark RDD (pyspark.rdd.RDD).
Examples
Repartitioning a pair RDD
Consider the following RDD:
# Create a RDD with 3 partitions
[('A', 1), ('B', 1), ('C', 1), ('A', 1)]
To see how this RDD is partitioned, use the glom() method:
We can indeed see that there are 3 partitions:
Partition one:
('A',1)and('B',1)Partition two:
('C',1)Partition three:
('A',1)
To re-partition into 2 partitions:
[('C', 1), ('A', 1), ('B', 1), ('A', 1)]
To see the contents of the new partitions:
We can indeed see that there are 2 partitions:
Partition one:
('C',1)Partition two:
('A',1),('B',1),('A', 1)
Notice how the tuple with the key A has ended up in the same partition. This is guaranteed to happen because the hash partitioner will perform bucketing based on the tuple key.