What does this mean?
Why is this true?
Give me some examples!
PySpark
147 guides
# PySpark RDD | repartition method

Aug 12, 2023
Last updated
local_offer
PySpark
Tags
PySpark RDD's `repartition(~)` method splits the RDD into the specified number of partitions.

NOTE

When we first create RDDs, they will already be partitioned under the hood, which means that all RDDs are already partitioned. This method is called `repartition(~)` (emphasis on the `re`) because we are changing the existing partitioning.

# Parameters

1. `numPartitions` | `int`

The number of partitions in which to split the RDD.

# Return Value

A PySpark RDD (`pyspark.rdd.RDD`).

# Examples

## Re-partitioning a RDD with certain number of partitions

Consider the following RDD:

``` rdd = sc.parallelize(["A","B","C","A","A","B"], numSlices=3)rdd.collect() ['A', 'B', 'C', 'A', 'A', 'B'] ```

Here, we are using the `parallelize(~)` method to create a RDD with 3 partitions.

We can use the `glom()` method to see the actual content of the partitions:

``` rdd.glom().collect() [['A', 'B'], ['C', 'A'], ['A', 'B']] ```

To repartition our RDD into 2 partitions:

``` new_rdd = rdd.repartition(2)new_rdd.glom().collect() [['A', 'B', 'A', 'B'], ['C', 'A']] ```

Notice how even if we repartition our RDD:

• the same values do not necessarily end up in the same partition (`'A'` can be found in both partitions)

• the number of elements in each partition may also not be balanced - here we have 4 elements in the first partition, while only 2 elements in the second partition.

WARNING

The `repartition(~)` method involves shufflinglink, even when reducing the number of partitions. To avoid shuffling when reducing the number of partitions, use RDD's `coalesce(~)` method instead.

