PySpark DataFrame | repartition method
Start your free 7-days trial now!
repartition(~) method returns a new PySpark DataFrame with the data split into the specified number of partitions. This method also allows to partition by column values.
The number of patitions to break down the DataFrame.
The columns by which to partition the DataFrame.
A new PySpark DataFrame.
Partitioning a PySpark DataFrame
Cosnider the following PySpark DataFrame:
By default, the number of partitions depends on the parallelism level of your PySpark configuration:
In my case, our PySpark DataFrame is split into 8 partitions by default.
We can see how the rows of our DataFrame are partitioned using the
glom() method of the underlying RDD:
Here, we can see that we have indeed 8 partitions, but only 3 of the partitions have a
Row in them.
Now, let's repartition our DataFrame such that the Rows are divided into only 2 partitions:
The distribution of the rows in our repartitioned DataFrame is now:
As demonstrated here, there is no guarantee that the rows will be evenly distributed in the partitions.
Partitioning a PySpark DataFrame by column values
Consider the following PySpark DataFrame:
To repartition this PySpark DataFrame by the column
name into 2 partitions:
Here, notice how the rows with the same value for
'Alex' in this case) end up in the same partition.
We can also repartition by multiple column values:
Here, we are repartitioning by the
age columns into
We can also use the default number of partitions by specifying column labels only: