PySpark 
 keyboard_arrow_down 147 guides
 chevron_leftPySpark RDD
  check_circle
 Mark as learned thumb_up
 0
 thumb_down
 0
 chat_bubble_outline
 0
 Comment  auto_stories Bi-column layout 
 settings
 PySpark RDD | filter method
 schedule Aug 12, 2023 
 Last updated  local_offer 
 Tags PySpark
  tocTable of Contents
 expand_more Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!
   Start your free 7-days trial now!
PySpark RDD's filter(~) method extracts a subset of the data based on the given function.
Parameters
1. f | function
A function that takes in as input an item of the RDD's data and returns a boolean where:
- Trueindicates keeping
- Falseindicates ignoring.
Return Value
A PySpark RDD (pyspark.rdd.PipelinedRDD).
Examples
Consider the following RDD:
        
        
            
                
                
                    rdd
                
            
            ParallelCollectionRDD[7] at readRDDFromInputStream at PythonRDD.scala:413
        
    Filtering elements of a RDD
To obtain a new RDD where the values are all strictly larger than 3:
        
        
            
                
                
                    new_rdd = rdd.filter(lambda x: x > 3)
                
            
            [4, 5, 7]
        
    Here, the collect() method is used to retrieve the content of the RDD as a single list.
Published by Isshin Inada
 Edited by 0 others
 Did you find this page useful?
 thumb_up
 thumb_down
 Comment
 Citation
  Ask a question or leave a feedback...
 Official PySpark Documentation
                    https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.RDD.filter.html
                 thumb_up
 0
 thumb_down
 0
 chat_bubble_outline
 0
 settings
 Enjoy our search
 Hit / to insta-search docs and recipes!
 