df = spark.createDataFrame([['Alex', 10], ['Mile', 30]], ['name', 'age'])
df.show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 10|
|Mile| 30|
+----+---+

Replacing a specific substring

To replace the substring 'le' with 'LE', use regexp_replace(~):


        
        
            
                
                
                    from pyspark.sql import functions as F
# Use an alias to assign a new name to the returned column
df.select(F.regexp_replace('name', 'le', 'LE').alias('new_name')).show()
                
            
            +--------+
|new_name|
+--------+
|    ALEx|
|    MiLE|
+--------+

NOTE

The second argument is a regular expression, so characters such as $ and [ will carry special meaning. In order to treat these special characters as literal characters, escape them using the \ character (e.g. \$).

Passing in a Column object

Instead of referring to the column by its name, we can also pass in a Column object:


        
        
            
                
                
                    df.select(F.regexp_replace(df.name, 'le', 'LE').alias('new_name')).show()
                
            
            +--------+
|new_name|
+--------+
|    ALEx|
|    MiLE|
+--------+

Getting a new PySpark DataFrame

We can use the PySpark DataFrame's withColumn(~) method to obtain a new PySpark DataFrame with the updated column like so:


        
        
            
                
                
                    df.withColumn('name', F.regexp_replace("name", 'le', 'LE').alias('new_name')).show()
                
            
            +----+---+
|name|age|
+----+---+
|ALEx| 10|
|MiLE| 30|
+----+---+

Replacing a specific substring using regular expression

To replace the substring 'le' that occur only at the end with 'LE', use regexp_replace(~):


        
        
            
                
                
                    from pyspark.sql import functions as F
df.select(F.regexp_replace('name', 'le$', 'LE').alias('new_name')).show()
                
            
            +--------+
|new_name|
+--------+
|    Alex|
|    MiLE|
+--------+

Here, we are using the special regular expression character '$' that only matches patterns occurring at the end of the string. This is the reason no replacement was done for the 'le' in Alex.

PySpark SQL Functions | regexp_extract method

PySpark SQL Functions' regexp_extract(~) method extracts a substring using regular expression.

chevron_right