# Create a RDD with 3 partitions
rdd = sc.parallelize([("A",1),("B",1),("C",1),("A",1)], numSlices=3)
rdd.collect()
                
            
            [('A', 1), ('B', 1), ('C', 1), ('A', 1)]

To see how this RDD is partitioned, use the glom() method:


        
        
            
                
                
                    rdd.glom().collect()
                
            
            [[('A', 1)], [('B', 1)], [('C', 1), ('A', 1)]]

We can indeed see that there are 3 partitions:

Partition one: ('A',1) and ('B',1)
Partition two: ('C',1)
Partition three: ('A',1)

To re-partition into 2 partitions:


        
        
            
                
                
                    new_rdd = rdd.partitionBy(numPartitions=2)
new_rdd.collect()
                
            
            [('C', 1), ('A', 1), ('B', 1), ('A', 1)]

To see the contents of the new partitions:


        
        
            
                
                
                    new_rdd.glom().collect()
                
            
            [[('C', 1)], [('A', 1), ('B', 1), ('A', 1)]]

We can indeed see that there are 2 partitions:

Partition one: ('C',1)
Partition two: ('A',1), ('B',1), ('A', 1)

Notice how the tuple with the key A has ended up in the same partition. This is guaranteed to happen because the hash partitioner will perform bucketing based on the tuple key.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.partitionBy.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!