df = spark.createDataFrame([["Alex", 25], ["Bob", 30], ["Cathy", 40]], ["name", "age"])
df.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 25|
|  Bob| 30|
|Cathy| 40|
+-----+---+

Replacing values for a single column

To replace the value "Alex" with "ALEX" in the name column:


        
        
            
                
                
                    df.replace("Alex", "ALEX", "name").show()
                
            
            +-----+---+
| name|age|
+-----+---+
| ALEX| 25|
|  Bob| 30|
|Cathy| 40|
+-----+---+

Note that a new PySpark DataFrame is returned, and the original DataFrame is kept intact.

Replacing multiple values for a single column

To replace the value "Alex" with "ALEX" and "Bob" with "BOB" in the name column:


        
        
            
                
                
                     df.replace(["Alex","Bob"], ["ALEX","BOB"], "name").show()
                
            
            +-----+---+
| name|age|
+-----+---+
| ALEX| 25|
|  BOB| 30|
|Cathy| 40|
+-----+---+

Replacing multiple values with a single value

To replace the values "Alex" and "Bob" with "SkyTowner" in the name column:


        
        
            
                
                
                    df.replace(["Alex","Bob"], "SkyTowner", "name").show()
                
            
            +---------+---+
|     name|age|
+---------+---+
|SkyTowner| 25|
|SkyTowner| 30|
|    Cathy| 40|
+---------+---+

Replacing values in the entire DataFrame

To replace the values "Alex" and "Bob" with "SkyTowner" in the entire DataFrame:


        
        
            
                
                
                    df.replace(["Alex","Bob"], "SkyTowner").show()
                
            
            +---------+---+
|     name|age|
+---------+---+
|SkyTowner| 25|
|SkyTowner| 30|
|    Cathy| 40|
+---------+---+

Here, notice how we did not specify the subset option.

Replacing values using a dictionary

To replace "Alex" with "ALEX" and "Bob" with "BOB" in the name column using a dictionary:


        
        
            
                
                
                    df.replace({
    "Alex": "ALEX",
    "Bob": "Bob",
}, subset=["name"]).show()

WARNING

Mixed-type replacements are not allowed. For instance, the following is not allowed:


        
        
            
                
                
                    df.replace({
    "Alex": "ALEX",
    30: 99,
}, subset=["name","age"]).show()
                
            
            ValueError: Mixed type replacements are not supported

Here, we are performing one string replacement and one integer replacement. Since this is a mix-typed replacement, PySpark throws an error. To avoid this error, perform the two replacements individually.

Replacing multiple values in multiple columns

Consider the following DataFrame:


        
        
            
                
                
                    df = spark.createDataFrame([["aa", "AA"], ["bb", "BB"]], ["col1", "col2"])
df.show()
                
            
            +----+----+
|col1|col2|
+----+----+
|  aa|  AA|
|  bb|  BB|
+----+----+

To replace certain values in col1 and col2:


        
        
            
                
                
                    df.replace({
    "AA": "@@@",
    "bb": "###",
}, subset=["col1","col2"]).show()
                
            
            +----+----+
|col1|col2|
+----+----+
|  aa| @@@|
| ###|  BB|
+----+----+

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.replace.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!