df = spark.createDataFrame([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
|  Bob| 24|
|Cathy| 22|
+-----+---+

Implementing if-else logic using when and otherwise

To rename the name Alex to Doge, and others to Eric:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.select(F.when(df.name == "Alex", "Doge").otherwise("Eric")).show()
                
            
            +-----------------------------------------------+
|CASE WHEN (name = Alex) THEN Doge ELSE Eric END|
+-----------------------------------------------+
|                                           Doge|
|                                           Eric|
|                                           Eric|
+-----------------------------------------------+

Notice how we used the method otherwise(~) to set values for cases when the conditions are not met.

Case when otherwise method is not used

Note that if you do not include the otherwise(~) method, then any value that does not fulfil the if condition will be assigned null:


        
        
            
                
                
                    df.select(F.when(df.name == "Alex", "Doge")).show()
                
            
            +-------------------------------------+
|CASE WHEN (name = Alex) THEN Doge END|
+-------------------------------------+
|                                 Doge|
|                                 null|
|                                 null|
+-------------------------------------+

Specifying multiple conditions

Using pipeline and ampersand operator

We can combine conditions using & (and) and | (or) like so:


        
        
            
                
                
                    df.withColumn("name", F.when((df.name == "Alex") & (df.age > 10), "Doge").otherwise("Eric")).show()
                
            
            +----+---+
|name|age|
+----+---+
|Doge| 20|
|Eric| 24|
|Eric| 22|
+----+---+

Chaining the when method

The when(~) method can be chained like so:


        
        
            
                
                
                    df.select(F.when(df.name == "Alex", "Doge")
           .when(df.name == "Bob", "Zebra")
           .otherwise("Eric")).show()
                
            
            +----------------------------------------------------------------------------+
|CASE WHEN (name = Alex) THEN Doge WHEN (name = Bob) THEN Zebra ELSE Eric END|
+----------------------------------------------------------------------------+
|                                                                        Doge|
|                                                                       Zebra|
|                                                                        Eric|
+----------------------------------------------------------------------------+

Setting a new value based on original value

To set a new value based on the original value:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.select(F.when(df.age > 15, df.age + 30)).show()
                
            
            +----------------------------------------+
|CASE WHEN (age > 15) THEN (age + 30) END|
+----------------------------------------+
|                                      50|
|                                      54|
|                                      52|
+----------------------------------------+

Using an alias

By default, the new column label is convoluted:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.select(F.when(df.name == "Alex", "Doge").otherwise("Eric")).show()
                
            
            +-----------------------------------------------+
|CASE WHEN (name = Alex) THEN Doge ELSE Eric END|
+-----------------------------------------------+
|                                           Doge|
|                                           Eric|
|                                           Eric|
+-----------------------------------------------+

To assign a new column, simply use the alias(~) method:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.select(F.when(df.name == "Alex", "Doge").otherwise("Eric").alias("new_name")).show()
                
            
            +--------+
|new_name|
+--------+
|    Doge|
|    Eric|
|    Eric|
+--------+

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.when.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!