Checking if value exists in PySpark DataFrame column
Start your free 7-days trial now!
Consider the following PySpark DataFrame:
+----+|vals|+----+| A|| A|| B|+----+
Checking if value exists using selectExpr method
To check if value exists in PySpark DataFrame column, use the selectExpr(~) method like so:
from pyspark.sql import functions as F
+---------------+|any((vals = A))|+---------------+| true|+---------------+
The selectExpr(~) takes in as argument a SQL expression, and returns a PySpark DataFrame. Here, the SQL expression uses the any(~) method which returns a True when the specified condition (vals == "A" in this case) is satisfied for at least one row and False otherwise.
The logic is similar to Pandas' any(~) method - you can think of vals == "A" returning a boolean mask, and the method any(~) returning True if there exists at least one True in the mask.
The PySpark DataFrame's selectExpr(~) can be rewritten using PySpark SQL Functions' expr(~) method:
We recommend using selectExpr(~) whenever possible because this saves you from having to import the pyspark.sql.functions library, and the syntax is shorter.
Getting a boolean instead of PySpark DataFrame
In the above solution, the output was a PySpark DataFrame. To extract the result as a boolean indicating whether a value exists or not:
True
Here, selectExpr(~) returns a PySpark DataFrame. We then call the collect(~) method which converts the rows of the DataFrame into a list of Row objects in the driver node:
[Row(any((vals = A))=True)]
We then access the Row object in the list using [0], and then access the value of the Row using another [0] to obtain the boolean value.
Checking if values exist given a list
To check if values exist in a PySpark Column given a list:
from pyspark.sql import functions as Flist_vals = ['A','D']
+---------------------+|any((vals IN (A, D)))|+---------------------+| true|+---------------------+
Note the following:
we are checking whether any value in the
valscolumn is equal to'A'or'D'- we have the value'A'in the column and so the result is aTrue.we convert the list into a string tuple (
"('A', 'B')") to align with the SQL syntax usingstr(tuple(~))
Checking if values exist using a OR query
The fact that selectExpr(~) accepts a SQL expression means that we can check for the existence of values flexibly. To check if values exist using an OR operator:
+-----------+|bool_exists|+-----------+| true|+-----------+
Here, note the following:
we are checking whether the value
BorCexists in thevalscolumn.we assign the label to the column returned by the SQL expression using the alias clause
AS.
Checking if values exist using a AND query
To check if all the given values exist in a PySpark Column:
+-----------+|bool_exists|+-----------+| true|+-----------+
Here, we are checking whether both the values A and B exist in the PySpark column.
Related
selectExpr(~) method returns a new DataFrame based on the specified SQL expression.expr(~) method parses the given SQL expression and returns a PySpark Column.