PySpark DataFrame | dropna method
Start your free 7-days trial now!
PySpark DataFrame's dropna(~) method removes row with missing values.
Parameters
1. how | string | optional
If
'any', then drop rows that contains any null value.If
'all', then drop rows that contain all null values.
By default, how='any'.
2. thresh | int | optional
Drop rows that have less non-null values than thresh. Note that this overrides the how parameter.
3. subset | string or tuple or list | optional
The rows to check for null values. By default, all rows will be checked.
Return Value
A PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
+-----+----+| name| age|+-----+----+| Alex| 20|| null|null||Cathy|null|+-----+----+
Dropping rows with at least one missing value in PySpark DataFrame
To drop rows with at least one missing value:
+----+---+|name|age|+----+---+|Alex| 20|+----+---+
Dropping rows with at least n non-missing values in PySpark DataFrame
To drop rows with at least 2 non-missing values:
n_non_missing_vals = 2
+----+---+|name|age|+----+---+|Alex| 20|+----+---+
Dropping rows with at least n missing values in PySpark DataFrame
To drop rows with at least 2 missing values:
Dropping rows with all missing values in PySpark DataFrame
To drop rows with all missing values:
+-----+----+| name| age|+-----+----+| Alex| 20||Cathy|null|+-----+----+
Dropping rows where certain value is missing in PySpark DataFrame
To drop rows where the value for age is missing:
+----+---+|name|age|+----+---+|Alex| 20|+----+---+
Dropping rows where certain values are missing (either) in PySpark DataFrame
To drop rows where either the name or age column value is missing:
+----+---+|name|age|+----+---+|Alex| 20|+----+---+
Dropping rows where certain values are missing (all) in PySpark DataFrame
To drop rows where the name and age column values are both missing:
+-----+----+| name| age|+-----+----+| Alex| 20||Cathy|null|+-----+----+