PySpark DataFrame | replace method
Start your free 7-days trial now!
PySpark DataFrame's replace(~) method returns a new DataFrame with certain values replaced. We can also specify which columns to perform replacement in.
Parameters
1. to_replace | boolean, number, string, list or dict | optional
The value to be replaced.
2. value | boolean, number, string or None | optional
The new value to replace to_replace.
3. subset | list | optional
The columns to focus on. By default, all columns will be checked for replacement.
Return Value
PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
+-----+---+| name|age|+-----+---+| Alex| 25|| Bob| 30||Cathy| 40|+-----+---+
Replacing values for a single column
To replace the value "Alex" with "ALEX" in the name column:
+-----+---+| name|age|+-----+---+| ALEX| 25|| Bob| 30||Cathy| 40|+-----+---+
Note that a new PySpark DataFrame is returned, and the original DataFrame is kept intact.
Replacing multiple values for a single column
To replace the value "Alex" with "ALEX" and "Bob" with "BOB" in the name column:
+-----+---+| name|age|+-----+---+| ALEX| 25|| BOB| 30||Cathy| 40|+-----+---+
Replacing multiple values with a single value
To replace the values "Alex" and "Bob" with "SkyTowner" in the name column:
+---------+---+| name|age|+---------+---+|SkyTowner| 25||SkyTowner| 30|| Cathy| 40|+---------+---+
Replacing values in the entire DataFrame
To replace the values "Alex" and "Bob" with "SkyTowner" in the entire DataFrame:
+---------+---+| name|age|+---------+---+|SkyTowner| 25||SkyTowner| 30|| Cathy| 40|+---------+---+
Here, notice how we did not specify the subset option.
Replacing values using a dictionary
To replace "Alex" with "ALEX" and "Bob" with "BOB" in the name column using a dictionary:
Mixed-type replacements are not allowed. For instance, the following is not allowed:
df.replace({ "Alex": "ALEX", 30: 99,}, subset=["name","age"]).show()
ValueError: Mixed type replacements are not supported
Here, we are performing one string replacement and one integer replacement. Since this is a mix-typed replacement, PySpark throws an error. To avoid this error, perform the two replacements individually.
Replacing multiple values in multiple columns
Consider the following DataFrame:
+----+----+|col1|col2|+----+----+| aa| AA|| bb| BB|+----+----+
To replace certain values in col1 and col2:
+----+----+|col1|col2|+----+----+| aa| @@@|| ###| BB|+----+----+