x = sc.parallelize(range(0,6), 3)
y = sc.parallelize(range(10, 16), 3)

Here, we are using the parallelize(~) method to create two RDDs, each having 3 partitions.

We can see the actual values in each partition using the glom(~) method:


        
        
            
                
                
                    x.glom().collect()
                
            
            [[0, 1], [2, 3], [4, 5]]

We see that RDD x indeed has 3 partitions, and we have 2 elements in each partition. The same can be said for RDD y:


        
        
            
                
                
                    y.glom().collect()
                
            
            [[10, 11], [12, 13], [14, 15]]

We can combine the two RDDs x and y into a single RDD of tuples using the zip(~) method:


        
        
            
                
                
                    zipped_rdd = x.zip(y)
zipped_rdd.collect()
                
            
            [(0, 10), (1, 11), (2, 12), (3, 13), (4, 14), (5, 15)]

WARNING

In order to use the zip(~) method, the two RDDs must have the exact same number of partitions as well as the exact same number of elements in each partition.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.zip.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!