search
Search
Join our weekly DS/ML newsletter layers DS/ML Guides
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
brightness_medium
share
arrow_backShare
Twitter
Facebook

Removing rows that contain specific substring in PySpark DataFrame

Machine Learning
chevron_right
PySpark
chevron_right
Cookbooks
schedule Jul 6, 2022
Last updated
local_offer PySpark
Tags

Consider the following PySpark DataFrame:

df = spark.createDataFrame([['A', 'a'], ['B#A', 'b'], ['C##', 'c']], ['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| A| a|
| B#A| b|
| C##| c|
+----+----+

Using the contains method to remove rows with certain substrings

To remove rows that contain specific substring (e.g. '#') in PySpark DataFrame, use the contains(~) method:

from pyspark.sql import functions as F
df.filter(~F.col('col1').contains('#')).show()
+----+----+
|col1|col2|
+----+----+
| A| a|
+----+----+

Here, we are first obtaining a boolean mask using the F.col('col1').contains('#') method:

df.select(F.col('col1').contains('#')).show()
+-----------------+
|contains(col1, #)|
+-----------------+
| false|
| true|
| true|
+-----------------+

We then reverse the boolean using the ~ operator:

df.select(~F.col('col1').contains('#')).show()
+-----------------------+
|(NOT contains(col1, #))|
+-----------------------+
| true|
| false|
| false|
+-----------------------+

Finally, we use the filter(~) method to extract rows that correspond to True in this boolean mask:

df.filter(~F.col('col1').contains('#')).show()
+----+----+
|col1|col2|
+----+----+
| A| a|
+----+----+

Using the rlike method to remove rows with values that match some regular expression

Once again, consider the same PySpark DataFrame as above:

df = spark.createDataFrame([['A', 'a'], ['B#A', 'b'], ['C##', 'c']], ['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| A| a|
| B#A| b|
| C##| c|
+----+----+

To remove rows where some string values match a regular expression, use the rlike(~) method:

df.filter(~F.col('col1').rlike('#$')).show()
+----+----+
|col1|col2|
+----+----+
| A| a|
| B#A| b|
+----+----+

Here, the rlike(~) method takes in as argument a regular expression (regex). The $ in regex #$ is a special character that matches the end of the string, that is, #$ matches the character # that occurs at the end of the string.

Note that just like the contains(~) method, rlike(~) also returns a boolean mask:

df.select(~F.col('col1').rlike('#$')).show()
+---------------------+
|(NOT RLIKE(col1, #$))|
+---------------------+
| true|
| true|
| false|
+---------------------+
NOTE

The rlike(~) method is equivalent to SQL's RLIKE clause.

Using the like method to remove rows that contain string values matching some pattern

Again, consider the same PySpark DataFrame as before:

df = spark.createDataFrame([['A', 'a'], ['B#A', 'b'], ['C##', 'c']], ['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| A| a|
| B#A| b|
| C##| c|
+----+----+

We could use the like(~) method to remove rows that contain string values matching some patterns:

df.filter(~F.col('col1').like('%#')).show()
+----+----+
|col1|col2|
+----+----+
| A| a|
| B#A| b|
+----+----+

Here, the special character % is a wildcard and matches any character. %# therefore matches all strings that end with #.

NOTE

like(~) method is equivalent to SQL's LIKE clause.

mail
Join our newsletter for updates on new DS/ML comprehensive guides (spam-free)
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
Ask a question or leave a feedback...