PySpark DataFrame | coalesce method
Start your free 7-days trial now!
coalesce(~) method reduces the number of partitions of the PySpark DataFrame without shuffling.
The number of partitions to split the PySpark DataFrame's data into.
A new PySpark DataFrame.
Consider the following PySpark DataFrame:
The default number of partitions is governed by your PySpark configuration. In my case, the default number of partitions is:
We can see the actual content of each partition of the PySpark DataFrame by using the underlying RDD's
We can see that we indeed have 8 partitions, 3 of which contain a
Reducing the number of partitions of a PySpark DataFrame without shuffling
To reduce the number of partitions of the DataFrame without shufflinglink, use
Here, we can see that we now only have 2 partitions!
Both the methods
coalesce(~) are used to change the number of partitions, but here are some notable differences: