search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

PySpark DataFrame | replace method

schedule Aug 12, 2023
Last updated
local_offer
PySpark
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

PySpark DataFrame's replace(~) method returns a new DataFrame with certain values replaced. We can also specify which columns to perform replacement in.

Parameters

1. to_replace | boolean, number, string, list or dict | optional

The value to be replaced.

2. value | boolean, number, string or None | optional

The new value to replace to_replace.

3. subset | list | optional

The columns to focus on. By default, all columns will be checked for replacement.

Return Value

PySpark DataFrame.

Examples

Consider the following PySpark DataFrame:

df = spark.createDataFrame([["Alex", 25], ["Bob", 30], ["Cathy", 40]], ["name", "age"])
df.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 25|
| Bob| 30|
|Cathy| 40|
+-----+---+

Replacing values for a single column

To replace the value "Alex" with "ALEX" in the name column:

df.replace("Alex", "ALEX", "name").show()
+-----+---+
| name|age|
+-----+---+
| ALEX| 25|
| Bob| 30|
|Cathy| 40|
+-----+---+

Note that a new PySpark DataFrame is returned, and the original DataFrame is kept intact.

Replacing multiple values for a single column

To replace the value "Alex" with "ALEX" and "Bob" with "BOB" in the name column:

df.replace(["Alex","Bob"], ["ALEX","BOB"], "name").show()
+-----+---+
| name|age|
+-----+---+
| ALEX| 25|
| BOB| 30|
|Cathy| 40|
+-----+---+

Replacing multiple values with a single value

To replace the values "Alex" and "Bob" with "SkyTowner" in the name column:

df.replace(["Alex","Bob"], "SkyTowner", "name").show()
+---------+---+
| name|age|
+---------+---+
|SkyTowner| 25|
|SkyTowner| 30|
| Cathy| 40|
+---------+---+

Replacing values in the entire DataFrame

To replace the values "Alex" and "Bob" with "SkyTowner" in the entire DataFrame:

df.replace(["Alex","Bob"], "SkyTowner").show()
+---------+---+
| name|age|
+---------+---+
|SkyTowner| 25|
|SkyTowner| 30|
| Cathy| 40|
+---------+---+

Here, notice how we did not specify the subset option.

Replacing values using a dictionary

To replace "Alex" with "ALEX" and "Bob" with "BOB" in the name column using a dictionary:

df.replace({
"Alex": "ALEX",
"Bob": "Bob",
}, subset=["name"]).show()
WARNING

Mixed-type replacements are not allowed. For instance, the following is not allowed:

df.replace({
"Alex": "ALEX",
30: 99,
}, subset=["name","age"]).show()
ValueError: Mixed type replacements are not supported

Here, we are performing one string replacement and one integer replacement. Since this is a mix-typed replacement, PySpark throws an error. To avoid this error, perform the two replacements individually.

Replacing multiple values in multiple columns

Consider the following DataFrame:

df = spark.createDataFrame([["aa", "AA"], ["bb", "BB"]], ["col1", "col2"])
df.show()
+----+----+
|col1|col2|
+----+----+
| aa| AA|
| bb| BB|
+----+----+

To replace certain values in col1 and col2:

df.replace({
"AA": "@@@",
"bb": "###",
}, subset=["col1","col2"]).show()
+----+----+
|col1|col2|
+----+----+
| aa| @@@|
| ###| BB|
+----+----+
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...