check_circleMark as learned
PySpark RDD | collect method
schedule Mar 5, 2023Last updated
tocTable of Contentsexpand_more
Check out the interactive map of data science
collect(~) method returns a list containing all the items in the RDD.
This method does not take in any parameters.
A Python standard list.
Converting a PySpark RDD into a list of values
Consider the following RDD:
rdd = sc.parallelize([4,2,5,7])rddParallelCollectionRDD at readRDDFromInputStream at PythonRDD.scala:413
This RDD is partitioned into 8 subsets:
Depending on your configuration, these 8 partitions can reside in multiple machines (working nodes). The
collect(~) method sends all the data of the RDD to the driver node, and packs them in a single list:
rdd = sc.parallelize([4,2,5,7])rdd.collect()[4, 2, 5, 7]
All the data from the worker nodes will be sent to the driver node, so make sure that you have enough memory for the driver node - otherwise you'll end up with an
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
Ask a question or leave a feedback...
Official PySpark Documentationhttps://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.RDD.collect.html
Enjoy our search
Hit / to insta-search docs and recipes!