search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Removing rows that contain specific substring in PySpark DataFrame

schedule Aug 12, 2023
Last updated
local_offer
PySpark
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

Consider the following PySpark DataFrame:

df = spark.createDataFrame([['A', 'a'], ['B#A', 'b'], ['C##', 'c']], ['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| A| a|
| B#A| b|
| C##| c|
+----+----+

Using the contains method to remove rows with certain substrings

To remove rows that contain specific substring (e.g. '#') in PySpark DataFrame, use the contains(~) method:

from pyspark.sql import functions as F
df.filter(~F.col('col1').contains('#')).show()
+----+----+
|col1|col2|
+----+----+
| A| a|
+----+----+

Here, we are first obtaining a boolean mask using the F.col('col1').contains('#') method:

df.select(F.col('col1').contains('#')).show()
+-----------------+
|contains(col1, #)|
+-----------------+
| false|
| true|
| true|
+-----------------+

We then reverse the boolean using the ~ operator:

df.select(~F.col('col1').contains('#')).show()
+-----------------------+
|(NOT contains(col1, #))|
+-----------------------+
| true|
| false|
| false|
+-----------------------+

Finally, we use the filter(~) method to extract rows that correspond to True in this boolean mask:

df.filter(~F.col('col1').contains('#')).show()
+----+----+
|col1|col2|
+----+----+
| A| a|
+----+----+

Using the rlike method to remove rows with values that match some regular expression

Once again, consider the same PySpark DataFrame as above:

df = spark.createDataFrame([['A', 'a'], ['B#A', 'b'], ['C##', 'c']], ['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| A| a|
| B#A| b|
| C##| c|
+----+----+

To remove rows where some string values match a regular expression, use the rlike(~) method:

df.filter(~F.col('col1').rlike('#$')).show()
+----+----+
|col1|col2|
+----+----+
| A| a|
| B#A| b|
+----+----+

Here, the rlike(~) method takes in as argument a regular expression (regex). The $ in regex #$ is a special character that matches the end of the string, that is, #$ matches the character # that occurs at the end of the string.

Note that just like the contains(~) method, rlike(~) also returns a boolean mask:

df.select(~F.col('col1').rlike('#$')).show()
+---------------------+
|(NOT RLIKE(col1, #$))|
+---------------------+
| true|
| true|
| false|
+---------------------+
NOTE

The rlike(~) method is equivalent to SQL's RLIKE clause.

Using the like method to remove rows that contain string values matching some pattern

Again, consider the same PySpark DataFrame as before:

df = spark.createDataFrame([['A', 'a'], ['B#A', 'b'], ['C##', 'c']], ['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| A| a|
| B#A| b|
| C##| c|
+----+----+

We could use the like(~) method to remove rows that contain string values matching some patterns:

df.filter(~F.col('col1').like('%#')).show()
+----+----+
|col1|col2|
+----+----+
| A| a|
| B#A| b|
+----+----+

Here, the special character % is a wildcard and matches any character. %# therefore matches all strings that end with #.

NOTE

like(~) method is equivalent to SQL's LIKE clause.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...