PySpark RDD | collect method
Start your free 7-days trial now!
collect(~) method returns a list containing all the items in the RDD.
This method does not take in any parameters.
A Python standard list.
Converting a PySpark RDD into a list of values
Consider the following RDD:
rdd = sc.parallelize([4,2,5,7])rddParallelCollectionRDD at readRDDFromInputStream at PythonRDD.scala:413
This RDD is partitioned into 8 subsets:
Depending on your configuration, these 8 partitions can reside in multiple machines (working nodes). The
collect(~) method sends all the data of the RDD to the driver node, and packs them in a single list:
rdd = sc.parallelize([4,2,5,7])rdd.collect()[4, 2, 5, 7]
All the data from the worker nodes will be sent to the driver node, so make sure that you have enough memory for the driver node - otherwise you'll end up with an