PySpark RDD | zipWithIndex method
Start your free 7-days trial now!
zipWithIndex(~) method returns a RDD of tuples where the first element of the tuple is the value and the second element is the index. The first value of the first partition will be given an index of 0.
This method does not take in any parameters.
A new PySpark RDD.
Consider the following PySpark RDD with 2 partitions:
We can see the content of each partition using the
We see that we indeed have 2 partitions with the first partition containing the value
'A', and the second containing the values
We can create a new RDD of tuples containing positional index information using
new_rdd = rdd.zipWithIndex()new_rdd.collect()[('A', 0), ('B', 1), ('C', 2)]
We see that the index position is assigned based on the partitioning position - the first element of the first partition will be assigned the 0th index.
zip(~)method combines the elements of two RDDs into a single RDD of tuples.