PySpark RDD | repartition method
Start your free 7-days trial now!
repartition(~) method splits the RDD into the specified number of partitions.
When we first create RDDs, they will already be partitioned under the hood, which means that all RDDs are already partitioned. This method is called
repartition(~) (emphasis on the
re) because we are changing the existing partitioning.
The number of partitions in which to split the RDD.
A PySpark RDD (
Re-partitioning a RDD with certain number of partitions
Consider the following RDD:
rdd = sc.parallelize(["A","B","C","A","A","B"], numSlices=3)rdd.collect()['A', 'B', 'C', 'A', 'A', 'B']
Here, we are using the
parallelize(~) method to create a RDD with 3 partitions.
We can use the
glom() method to see the actual content of the partitions:
To repartition our RDD into 2 partitions:
Notice how even if we repartition our RDD:
the same values do not necessarily end up in the same partition (
'A'can be found in both partitions)
the number of elements in each partition may also not be balanced - here we have 4 elements in the first partition, while only 2 elements in the second partition.
repartition(~) method involves shufflinglink, even when reducing the number of partitions. To avoid shuffling when reducing the number of partitions, use RDD's
coalesce(~) method instead.