search
Search
Login
Math ML Join our weekly DS/ML newsletter
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
brightness_medium
share
arrow_backShare
Twitter
Facebook
check_circle
Mark as learned
thumb_up
1
thumb_down
0
chat_bubble_outline
0
auto_stories new
settings

Trimming specific characters in PySpark DataFrame

Machine Learning
chevron_right
PySpark
chevron_right
Cookbooks
chevron_right
DataFrame Cookbooks
chevron_right
String operations
schedule Jul 5, 2022
Last updated
local_offer PySpark
Tags

To trim specific leading and trailing characters in PySpark DataFrame, use the regexp_replace(~) function.

As an example, consider the following PySpark DataFrame:

df = spark.createDataFrame([['##A'],['B##'],['#C#']], ['vals'])
df.show()
+----+
|vals|
+----+
| ##A|
| B##|
| #C#|
+----+

Trimming specific leading characters

To remove the leading # characters, use the regexp_replace(~) function:

from pyspark.sql import functions as F
df.select(F.regexp_replace('vals', '^#+', '').alias('new_vals')).show()
+--------+
|new_vals|
+--------+
| A|
| B##|
| C#|
+--------+

The arguments of regexp_replace(~) are as follows (in order):

  • the label of the column to perform the replace operations

  • the regular expression (regex) to match substrings that are to be replaced

  • the string to replace the matched regex (an empty string '' essentially means removal of matched substring)

In this case, the regex we match is ^#+. The ^ is a special character in regex which matches the beginning of the string, that is, ^ matches leading characters. The + is another special character in regex that matches one or more of the preceding character (#).

Note that we are using the alias(~) function here to assign a label to the column returned by regexp_repalce(~) method.

Trimming specific trailing characters

Similarly, to remove specific trailing characters, use the regexp_replace(~) function with the regex #+$:

# Replace the substrings matched by the regex #+$
# with an empty string '' in the vals column
df.select(F.regexp_replace('vals', '#+$', '').alias('new_vals')).show()
+--------+
|new_vals|
+--------+
| ##A|
| B|
| #C|
+--------+

Here, the $ in #+$ matches the end of the string.

Trimming specific leading and trailing characters

Consider the same PySpark DataFrame as before:

df = spark.createDataFrame([['##A'],['B##'],['#C#']], ['vals'])
df.show()
+----+
|vals|
+----+
| ##A|
| B##|
| #C#|
+----+

Again, to remove specific leading and trailing characters, use regexp_replace(~):

from pyspark.sql import functions as F
df.select(F.regexp_replace('vals', '^#+|#+$', '').alias('new_vals')).show()
+--------+
|new_vals|
+--------+
| A|
| B|
| C|
+--------+

Here, the pipeline character | in the regex ^#+|#+$ represents an OR. This means that we are matching leading # characters (^#+) or the trailing # characters (#+$).

Trimming specific substrings

Consider the following PySpark DataFrame:

df = spark.createDataFrame([['#@A'],['B#@'],['#C#@D']], ['vals'])
df.show()
+-----+
| vals|
+-----+
| #@A|
| B#@|
|#C#@D|
+-----+

To trim substrings from a PySpark DataFrame, again use the regexp_replace(~) function:

from pyspark.sql import functions as F
df.select(F.regexp_replace('vals', '^(#@)|(#@)$', '').alias('new_vals')).show()
+--------+
|new_vals|
+--------+
| A|
| B|
| #C#@D|
+--------+

Here, the parentheses in our regex ^(#@)|(#@)$ allows us to group characters together to form a substring. For instance, the regex ^(#@) matches the leading substring #@.

mail
Join our newsletter for updates on new DS/ML comprehensive guides (spam-free)
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Ask a question or leave a feedback...
thumb_up
1
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!