PySpark DataFrame | selectExpr method
Start your free 7-days trial now!
PySpark DataFrame's selectExpr(~) method returns a new DataFrame based on the specified SQL expression.
Parameters
1. *expr | string
The SQL expression.
Return Value
A new PySpark DataFrame.
Examples
Consider the following PySpark DataFrame:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 30||Cathy| 40|+-----+---+
Selecting data using SQL expressions in PySpark DataFrame
To get a new DataFrame where the values for the name column is uppercased:
+----------+---------+|upper_name|(age * 2)|+----------+---------+| ALEX| 40|| BOB| 60|| CATHY| 80|+----------+---------+
We should use selectExpr(~) rather than select(~) to extract columns while performing some simple transformations on them - just as we have done here.
There exists a similar method expr(~) in the pyspark.sql.functions library. expr(~) also takes in as argument a SQL expression, but the difference is that the return type is a PySpark Column. The following usage of selectExpr(~) and expr(~) are equivalent:
In general, you should use selectExpr(~) rather than expr(~) because:
you won't have to import the
pyspark.sql.functionslibrary.the syntax is shorter and clearer
Parsing more complex SQL expressions
Consider the following PySpark DataFrame:
+----+---+|name|age|+----+---+|Alex| 20|| Bob| 60|+----+---+
We can use classic SQL clauses like AND and LIKE to formulate more complicated expressions:
+------+|result|+------+| true|| false|+------+
Here, we are checking for rows where age is less than 30 and the name starts with the letter A.
Note that we can implement the same logic like so:
+------+|result|+------+| true|| false|+------+
I personally prefer using selectExpr(~) because the syntax is cleaner and the meaning is intuitive for those who are familiar with SQL.
Checking for the existence of values in PySpark column
Another application of selectExpr(~) is to check for the existence of values in a PySpark column. Please check out the recipe here.