Replacing certain substrings in PySpark DataFrame column
Start your free 7-days trial now!
To replace certain substrings in column values of a PySpark DataFrame, use either PySpark SQL Functions' translate(~) method or regexp_replace(~) method.
As an example, consider the following PySpark DataFrame:
+------+| name|+------+|!A@lex|| B#ob|+------+
Replacing certain characters
Suppose we wanted to make the following character replacements:
'!' replaced by '3''@' replaced by '4''#' replaced by '5'
We can use the translate(~) method like so:
from pyspark.sql import functions as F
+------+| new|+------+|3A4lex|| B5ob|+------+
The withColumn(~) here is used to replace the name column with our new column.
Replacing certain substrings
Consider the following PySpark DataFrame:
+-----+| name|+-----+|A@@ex|| @Bob|+-----+
To replace certain substrings, use the regexp_replace(~) method:
from pyspark.sql import functions as F
+----+|name|+----+|Alex||@Bob|+----+
Here, note the following:
we are replacing the substring
"@@"with the letter"l".
The second argument of regexp_replace(~) is a regular expression. This means that certain characters such as $ and [ carry special meaning. To replace literal substrings, escape special regex characters using backslash \ (.g. \[).
Replacing certain substrings using Regex
Consider the following PySpark DataFrame:
+----+|name|+----+|A@ex||@Bob|+----+
To replace @ if it's at the beginning of the string with another string, use regexp_replace(~):
from pyspark.sql import functions as F
+----+|name|+----+|A@ex||*Bob|+----+
Here, the regex ^@ represents @ that is at the start of the string.
Replacing certain substrings in multiple columns
The regexp_replace(~) can only be performed on one column at a time.
For example, consider the following PySpark DataFrame:
+---+---+| A| B|+---+---+| @a| @b|| @c| @d|+---+---+
To replace the substring '@' with '#' for columns A and B:
str_before = '@'str_after = '#'
+---+---+| A| B|+---+---+| #a| #b|| #c| #d|+---+---+
Related
translate(~) method replaces the specified characters by the desired characters.regexp_replace(~) method replaces the matched regular expression with the specified string.