rdd = sc.parallelize(["A","B","C","A","A","B"], numSlices=3)
rdd.collect()
                
            
            ['A', 'B', 'C', 'A', 'A', 'B']

Here, we are using the parallelize(~) method to create a RDD with 3 partitions.

We can use the glom() method to see the actual content of the partitions:


        
        
            
                
                
                    rdd.glom().collect()
                
            
            [['A', 'B'], ['C', 'A'], ['A', 'B']]

To repartition our RDD into 2 partitions:


        
        
            
                
                
                    new_rdd = rdd.repartition(2)
new_rdd.glom().collect()
                
            
            [['A', 'B', 'A', 'B'], ['C', 'A']]

Notice how even if we repartition our RDD:

the same values do not necessarily end up in the same partition ('A' can be found in both partitions)
the number of elements in each partition may also not be balanced - here we have 4 elements in the first partition, while only 2 elements in the second partition.

WARNING

The repartition(~) method involves shufflinglink, even when reducing the number of partitions. To avoid shuffling when reducing the number of partitions, use RDD's coalesce(~) method instead.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.repartition.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!