import datetime
df = spark.createDataFrame([['Alex', datetime.date(1998,12,16)], ['Bob', datetime.date(1995,5,9)]], ['name', 'birthday'])
df.show()
                
            
            +----+----------+
|name|  birthday|
+----+----------+
|Alex|1998-12-16|
| Bob|1995-05-09|
+----+----------+

Here, birthday is of type date:


        
        
            
                
                
                    df.printSchema()
                
            
            root
 |-- name: string (nullable = true)
 |-- birthday: date (nullable = true)

Use the F.min(~) method to get the earliest date, and use the F.max(~) method to get the latest date:


        
        
            
                
                
                    from pyspark.sql import functions as F

col_earlist_date = F.min('birthday').alias('earliest')
col_latest_date = F.max('birthday').alias('latest')
df_result = df.select(col_earlist_date, col_latest_date)
df_result.show()
                
            
            +----------+----------+
|  earliest|    latest|
+----------+----------+
|1995-05-09|1998-12-16|
+----------+----------+

Here, we are using the alias(~) method to assign a label to the PySpark column returned by F.min(~) and F.max(~).

To extract the earliest and latest dates as variables instead of a PySpark DataFrame:


        
        
            
                
                
                    list_rows = df_result.collect()
print(f'Earliest date: {list_rows[0][0]}')   # type is datetime.date
print(f'Latest date: {list_rows[0][1]}')
                
            
            Earliest date: 1995-05-09
Latest date: 1998-12-16

Here, we are using the PySpark DataFrame's collect() method to convert the row into a list of Row object in the driver node:


        
        
            
                
                
                    list_rows = df_result.collect()
list_rows
                
            
            [Row(earliest=datetime.date(1995, 5, 9), latest=datetime.date(1998, 12, 16))]

Getting earliest and latest date for date string columns

The above solution works when the column is of type date. If you have date strings, then you must first convert the date strings into native dates using the to_date(~) method.

For example, consider the following PySpark DataFrame with some date strings:


        
        
            
                
                
                    df = spark.createDataFrame([['Alex', '1998-12-16'], ['Bob', '1995-5-9']], ['name', 'birthday'])
df.show()
                
            
            +----+----------+
|name|  birthday|
+----+----------+
|Alex|1998-12-16|
| Bob|  1995-5-9|
+----+----------+

We can convert the date strings to native dates using to_date(~):


        
        
            
                
                
                    col_date = F.to_date(df['birthday'], 'yyyy-M-d')
col_earliest = F.min(col_date).alias('earliest')
col_latest = F.max(col_date).alias('latest')
df.select(col_earliest, col_latest).show()
                
            
            +----------+----------+
|  earliest|    latest|
+----------+----------+
|1995-05-09|1998-12-16|
+----------+----------+

Here, the second argument of to_date(~) specifies the format of the date string.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!