search
Search
Login
Map of Data Science
menu
menu search toc more_vert
Robocat
Guest 0reps
Sign up
Log in
account_circleMy Profile homeAbout paidPricing
emailContact us
exit_to_appLog out
Map of data science
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
brightness_medium
share
arrow_backShare
Twitter
Facebook

PySpark DataFrame | dropna method

Machine Learning
chevron_right
PySpark
chevron_right
Documentation
chevron_right
PySpark DataFrame
schedule Jul 1, 2022
Last updated
local_offer PySpark
Tags
map
Check out the interactive map of data science

PySpark DataFrame's dropna(~) method removes row with missing values.

Parameters

1. how | string | optional

  • If 'any', then drop rows that contains any null value.

  • If 'all', then drop rows that contain all null values.

By default, how='any'.

2. thresh | int | optional

Drop rows that have less non-null values than thresh. Note that this overrides the how parameter.

3. subset | string or tuple or list | optional

The rows to check for null values. By default, all rows will be checked.

Return Value

A PySpark DataFrame.

Examples

Consider the following PySpark DataFrame:

df = spark.createDataFrame([["Alex", 20], [None, None], ["Cathy", None]], ["name", "age"])
df.show()
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
| null|null|
|Cathy|null|
+-----+----+

Dropping rows with at least one missing value in PySpark DataFrame

To drop rows with at least one missing value:

df.dropna().show()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+

Dropping rows with at least n non-missing values in PySpark DataFrame

To drop rows with at least 2 non-missing values:

n_non_missing_vals = 2
df.dropna(thresh=n_non_missing_vals).show()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+

Dropping rows with at least n missing values in PySpark DataFrame

To drop rows with at least 2 missing values:

n_missing_vals = 2
df.dropna(thresh=len(df.columns)-n_missing_vals+1).show()
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
|Cathy|null|
+-----+----+

Dropping rows with all missing values in PySpark DataFrame

To drop rows with all missing values:

df.dropna(how='all').show()
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
|Cathy|null|
+-----+----+

Dropping rows where certain value is missing in PySpark DataFrame

To drop rows where the value for age is missing:

df.dropna(subset='age').show()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+

Dropping rows where certain values are missing (either) in PySpark DataFrame

To drop rows where either the name or age column value is missing:

df.dropna(subset=['name','age'], how='any').show()
+----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+

Dropping rows where certain values are missing (all) in PySpark DataFrame

To drop rows where the name and age column values are both missing:

df.dropna(subset=['name','age'], how='all').show()
+-----+----+
| name| age|
+-----+----+
| Alex| 20|
|Cathy|null|
+-----+----+
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down