PySpark RDD | zip method
zip(~) method combines the elements of two RDDs into a single RDD of tuples.
The other RDD to combine with.
A new PySpark RDD.
Combining two PySpark RDDs into a single RDD of tuples
Consider the following two PySpark RDDs:
Here, we are using the
parallelize(~) method to create two RDDs, each having 3 partitions.
We can see the actual values in each partition using the
We see that RDD
x indeed has 3 partitions, and we have 2 elements in each partition. The same can be said for RDD
We can combine the two RDDs
y into a single RDD of tuples using the
zipped_rdd = x.zip(y)zipped_rdd.collect()[(0, 10), (1, 11), (2, 12), (3, 13), (4, 14), (5, 15)]
In order to use the
zip(~) method, the two RDDs must have the exact same number of partitions as well as the exact same number of elements in each partition.