PySpark RDD | coalesce method
coalesce(~) method returns a new RDD with the number of partitions reduced.
The number of partitions to reduce to.
Whether or not to shuffle the data such that they end up in different partitions. By default,
A PySpark RDD (
Consider the following RDD with 3 partitions:
Reducing the number of partitions of RDD
To reduce the number of partitions to 2:
We can see that the 2nd partition merged with the 3rd partition.
Balanced partitioning of RDD using shuffle
Instead of merging partitions to reduce the number partitions, we can also shuffle the data:
As you can see, this results in a partitioning that is more balanced. The downside to shuffling, however, is that this is a costly process when your data size is large since data must be transferred from one worker node to another.