df = spark.createDataFrame([["Alex", 20], ["Bob", 30], ["Cathy", 40]], ["name", "age"])
df.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
|  Bob| 30|
|Cathy| 40|
+-----+---+

Getting rows that contain a substring in PySpark DataFrame

To get rows that contain the substring "le":


        
        
            
                
                
                    from pyspark.sql import functions as F
df.filter(F.col("name").contains("le")).show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+

Here, F.col("name").contains("le") returns a Column object holding booleans where True corresponds to strings that contain the substring "le":


        
        
            
                
                
                    df.select(F.col("name").contains("le")).show()
                
            
            +------------------+
|contains(name, le)|
+------------------+
|              true|
|             false|
|             false|
+------------------+

In our solution, we use the filter(~) method to extract rows that correspond to True.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.contains.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!