chevron_left
PySpark RDD
0
0
0
new
PySpark RDD | filter method
Machine Learning
chevron_rightPySpark
chevron_rightDocumentation
chevron_rightPySpark RDD
schedule Jun 19, 2022
Last updated PySpark
Tags tocTable of Contents
expand_more PySpark RDD's filter(~)
method extracts a subset of the data based on the given function.
Parameters
1. f
| function
A function that takes in as input an item of the RDD's data and returns a boolean where:
True
indicates keepingFalse
indicates ignoring.
Return Value
A PySpark RDD (pyspark.rdd.PipelinedRDD
).
Examples
Consider the following RDD:
rdd
ParallelCollectionRDD[7] at readRDDFromInputStream at PythonRDD.scala:413
Filtering elements of a RDD
To obtain a new RDD where the values are all strictly larger than 3:
new_rdd = rdd.filter(lambda x: x > 3)
[4, 5, 7]
Here, the collect()
method is used to retrieve the content of the RDD as a single list.
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
Ask a question or leave a feedback...
Official PySpark Documentation
https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.RDD.filter.html