search
Search
Join our weekly DS/ML newsletter layers DS/ML Guides
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
brightness_medium
share
arrow_backShare
Twitter
Facebook

Checking if value exists in PySpark DataFrame column

Machine Learning
chevron_right
PySpark
chevron_right
Cookbooks
schedule Jul 2, 2022
Last updated
local_offer PySpark
Tags

Consider the following PySpark DataFrame:

df = spark.createDataFrame([['A'],['A'],['B']], ['vals'])
df.show()
+----+
|vals|
+----+
| A|
| A|
| B|
+----+

Checking if value exists using selectExpr method

To check if value exists in PySpark DataFrame column, use the selectExpr(~) method like so:

from pyspark.sql import functions as F
df.selectExpr('any(vals == "A")').show()
+---------------+
|any((vals = A))|
+---------------+
| true|
+---------------+

The selectExpr(~) takes in as argument a SQL expression, and returns a PySpark DataFrame. Here, the SQL expression uses the any(~) method which returns a True when the specified condition (vals == "A" in this case) is satisfied for at least one row and False otherwise.

The logic is similar to Pandas' any(~) method - you can think of vals == "A" returning a boolean mask, and the method any(~) returning True if there exists at least one True in the mask.

NOTE

The PySpark DataFrame's selectExpr(~) can be rewritten using PySpark SQL Functions' expr(~) method:

from pyspark.sql import functions as F
df.select(F.expr('any(vals == "A")')).show()
+---------------+
|any((vals = A))|
+---------------+
| true|
+---------------+

We recommend using selectExpr(~) whenever possible because this saves you from having to import the pyspark.sql.functions library, and the syntax is shorter.

Getting a boolean instead of PySpark DataFrame

In the above solution, the output was a PySpark DataFrame. To extract the result as a boolean indicating whether a value exists or not:

df.selectExpr('any(vals == "A")').collect()[0][0]
True

Here, selectExpr(~) returns a PySpark DataFrame. We then call the collect(~) method which converts the rows of the DataFrame into a list of Row objects in the driver node:

df.selectExpr('any(vals == "A")').collect()
[Row(any((vals = A))=True)]

We then access the Row object in the list using [0], and then access the value of the Row using another [0] to obtain the boolean value.

Checking if values exist given a list

To check if values exist in a PySpark Column given a list:

from pyspark.sql import functions as F
list_vals = ['A','D']
df.selectExpr(f'any(vals IN {str(tuple(list_vals))})').show()
+---------------------+
|any((vals IN (A, D)))|
+---------------------+
| true|
+---------------------+

Note the following:

  • we are checking whether any value in the vals column is equal to 'A' or 'D' - we have the value 'A' in the column and so the result is a True.

  • we convert the list into a string tuple ("('A', 'B')") to align with the SQL syntax using str(tuple(~))

Checking if values exist using a OR query

The fact that selectExpr(~) accepts a SQL expression means that we can check for the existence of values flexibly. To check if values exist using an OR operator:

df.selectExpr('any(vals == "B" OR vals == "C") AS bool_exists').show()
+-----------+
|bool_exists|
+-----------+
| true|
+-----------+

Here, note the following:

  • we are checking whether the value B or C exists in the vals column.

  • we assign the label to the column returned by the SQL expression using the alias clause AS.

Checking if values exist using a AND query

To check if all the given values exist in a PySpark Column:

df.selectExpr('any(vals == "A") AND any(vals == "B") AS bool_exists').show()
+-----------+
|bool_exists|
+-----------+
| true|
+-----------+

Here, we are checking whether both the values A and B exist in the PySpark column.

mail
Join our newsletter for updates on new DS/ML comprehensive guides (spam-free)
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
Ask a question or leave a feedback...