PySpark

147 guides

keyboard_arrow_down

Other math topics

Dagster

Pandas

NumPy

Matplotlib

PySpark

MySQL

chevron_leftString operations

Extracting substrings Getting shortest and longest strings Removing substrings Replacing substrings Trimming specific characters

check_circle

Mark as learned

thumb_up

thumb_down

chat_bubble_outline

Comment

auto_stories Bi-column layout

settings

Replacing certain substrings in PySpark DataFrame column

schedule Aug 12, 2023

Last updated

local_offer

PySpark

Replacing certain characters

Suppose we wanted to make the following character replacements:


        
        
            
                
                
                    '!' replaced by '3'
'@' replaced by '4'
'#' replaced by '5'

We can use the translate(~) method like so:


        
        
            
                
                
                    from pyspark.sql import functions as F
df_new = df.withColumn("name", F.translate("name", "!@#", "345"))
df_new.show()
                
            
            +------+
|   new|
+------+
|3A4lex|
|  B5ob|
+------+

The withColumn(~) here is used to replace the name column with our new column.

Replacing certain substrings

Consider the following PySpark DataFrame:


        
        
            
                
                
                    df = spark.createDataFrame([["A@@ex"], ["@Bob"]], ["name"])
df.show()
                
            
            +-----+
| name|
+-----+
|A@@ex|
| @Bob|
+-----+

To replace certain substrings, use the regexp_replace(~) method:


        
        
            
                
                
                    from pyspark.sql import functions as F
df_new = df.withColumn("name", F.regexp_replace("name", "@@", "l"))
df_new.show()
                
            
            +----+
|name|
+----+
|Alex|
|@Bob|
+----+

Here, note the following:

we are replacing the substring "@@" with the letter "l".

NOTE

The second argument of regexp_replace(~) is a regular expression. This means that certain characters such as $ and [ carry special meaning. To replace literal substrings, escape special regex characters using backslash \ (.g. \[).

Replacing certain substrings using Regex

Consider the following PySpark DataFrame:


        
        
            
                
                
                    df = spark.createDataFrame([["A@ex"], ["@Bob"]], ["name"])
df.show()
                
            
            +----+
|name|
+----+
|A@ex|
|@Bob|
+----+

To replace @ if it's at the beginning of the string with another string, use regexp_replace(~):


        
        
            
                
                
                    from pyspark.sql import functions as F
df_new = df.withColumn("name", F.regexp_replace("name", "^@", "*"))
df_new.show()
                
            
            +----+
|name|
+----+
|A@ex|
|*Bob|
+----+

Here, the regex ^@ represents @ that is at the start of the string.

Replacing certain substrings in multiple columns

The regexp_replace(~) can only be performed on one column at a time.

For example, consider the following PySpark DataFrame:


        
        
            
                
                
                    df = spark.createDataFrame([['@a','@b'], ['@c','@d']], ['A', 'B'])
df.show()
                
            
            +---+---+
|  A|  B|
+---+---+
| @a| @b|
| @c| @d|
+---+---+

To replace the substring '@' with '#' for columns A and B:


        
        
            
                
                
                    str_before = '@'
str_after = '#'
df_new = df.withColumn('A', F.regexp_replace('A', str_before, str_after))
df_new = df_new.withColumn('B', F.regexp_replace('B', str_before, str_after))
df_new.show()
                
            
            +---+---+
|  A|  B|
+---+---+
| #a| #b|
| #c| #d|
+---+---+