search
Search
Join our weekly DS/ML newsletter layers DS/ML Guides
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
brightness_medium
share
arrow_backShare
Twitter
Facebook
0
thumb_down
0
chat_bubble_outline
0
auto_stories new
settings

Removing substring in column values of PySpark DataFrame

Machine Learning
chevron_right
PySpark
chevron_right
Cookbooks
chevron_right
DataFrame Cookbooks
chevron_right
String operations
schedule Jul 1, 2022
Last updated
local_offer PySpark
Tags

Consider the following PySpark DataFrame:

rows = [["Alex", 25], ["Bob", 30]]
df = spark.createDataFrame(rows, ["name", "age"])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

Removing substring using the regexp_replace method

To remove the substring "le" from the name column in our PySpark DataFrame, use the regexp_replace(~) method:

from pyspark.sql import functions as F
df_new = df.withColumn("name", F.regexp_replace("name", "le", ""))
df_new.show()
+----+---+
|name|age|
+----+---+
| Ax| 25|
| Bob| 30|
+----+---+

Here, note the following:

  • we are using the PySpark SQL function regexp_replace(~) to replace the substring "le" with an empty string, which is equivalent to removing the substring "le".

  • the second argument of regexp_replace(~) method is a regular expression, which means that certain regex characters such as [ and ( will be treated differently. For instance, the following will throw an error:

    from pyspark.sql import functions as F
    df_new = df.withColumn("name", F.regexp_replace("name", "[le", ""))
    df_new.show()
    java.util.regex.PatternSyntaxException: Unclosed character class near index 2

    To avoid special treatment of regex characters, escape them using backslash \:

    df_new = df.withColumn("name", F.regexp_replace("name", "\[le", ""))
  • Finally, we use the PySpark DataFrame's withColumn(~) method to return a new DataFrame with the updated name column.

Using a regular expression to drop substrings

The fact that the regexp_replace(~) method allows you to match substrings using regular expression gives you a lot of flexibility in which substrings are to be dropped. For instance, consider the following PySpark DataFrame:

df = spark.createDataFrame([['Alex', 10], ['Mile', 30]], ['name', 'age'])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 10|
|Mile| 30|
+----+---+

To drop the substring 'le' that only occurs at the end of the string:

df.select(F.regexp_replace(df.name, 'le$', '').alias('new_name')).show()
+--------+
|new_name|
+--------+
| Alex|
| Mi|
+--------+

Here, the regular expression character $ matches only trailing occurrences of 'le'.

Removing a list of substrings using regexp_replace method

Again, consider the same PySpark DataFrame as above:

df = spark.createDataFrame([["Alex", 25], ["Bob", 30]], ['name', 'age'])
df.show()
+----+---+
|name|age|
+----+---+
|Alex| 25|
| Bob| 30|
+----+---+

To remove a list of substrings, we can again take advantage of the fact that regexp_replace() uses regular expression to match substrings that will be replaced:

from pyspark.sql import functions as F
substr_to_remove = ["le","B"]
regex = "|".join(substr_to_remove)
df_new = df.withColumn("name", F.regexp_replace("name", regex, ""))
df_new.show()
+----+---+
|name|age|
+----+---+
| Ax| 25|
| ob| 30|
+----+---+

Here, we are constructing a regex string using the OR operator (|):

substr_to_remove = ["le","B"]
regex = "|".join(substr_to_remove)
regex
'le|B'

The regexp_replace(~) method will then replace either the substring "le" or "B" with an empty string:

df_new = df.withColumn("name", F.regexp_replace("name", regex, ""))
df_new.show()
+----+---+
|name|age|
+----+---+
| Ax| 25|
| ob| 30|
+----+---+
mail
Join our newsletter for updates on new DS/ML comprehensive guides (spam-free)
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
Ask a question or leave a feedback...
0
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!