PySpark DataFrame | where method
Start your free 7-days trial now!
PySpark DataFrame's where(~) method returns rows of the DataFrame that satisfies the given condition.
The where(~) method is an alias for the filter(~) method.
Parameters
1. condition | Column or string
A boolean mask (Column) or a SQL string expression.
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 30||Cathy| 40|+-----+---+
Basic usage
To get rows where age is greater than 25:
+-----+---+| name|age|+-----+---+| Bob| 30||Cathy| 40|+-----+---+
Equivalently, we can pass a Column object that represents a boolean mask:
+-----+---+| name|age|+-----+---+| Bob| 30||Cathy| 40|+-----+---+
Equivalently, we can use the col(~) function of sql.functions to refer to the column:
Compound queries
The where(~) method supports the AND and OR statement like so:
+----+---+|name|age|+----+---+| Bob| 30|+----+---+
Dealing with null values
Consider the following PySpark DataFrame:
+-----+----+| name| age|+-----+----+| Alex| 20|| null|null||Cathy|null|+-----+----+
Let's query for rows where age!=10 like so:
+----+---+|name|age|+----+---+|Alex| 20|+----+---+
Notice how only Alex's row is returned even though the other two rows technically have age!=10. This happens because PySpark's where(-) method filters our null values by default.
To prevent rows with null values getting filtered out, we can perform the query like so:
Note that PySpark's treatment of null values is different compared to Pandas because Pandas will retain rows with missing values, as demonstrated below:
col1 b2 None
Notice how the row with col=None is not left out!