PySpark RDD | zipWithIndex method
Start your free 7-days trial now!
PySpark RDD's zipWithIndex(~)
method returns a RDD of tuples where the first element of the tuple is the value and the second element is the index. The first value of the first partition will be given an index of 0.
Parameters
This method does not take in any parameters.
Return Value
A new PySpark RDD.
Examples
Consider the following PySpark RDD with 2 partitions:
filter_none
Copy
['A', 'B', 'C']
We can see the content of each partition using the glom()
method:
filter_none
Copy
We see that we indeed have 2 partitions with the first partition containing the value 'A'
, and the second containing the values 'B'
and 'C'
.
We can create a new RDD of tuples containing positional index information using zipWithIndex(~)
:
filter_none
Copy
new_rdd = rdd.zipWithIndex()
[('A', 0), ('B', 1), ('C', 2)]
We see that the index position is assigned based on the partitioning position - the first element of the first partition will be assigned the 0th index.