Getting earliest and latest date in PySpark DataFrame
Start your free 7-days trial now!
Getting earliest and latest date for date columns
Consider the following PySpark DataFrame:
import datetimedf = spark.createDataFrame([['Alex', datetime.date(1998,12,16)], ['Bob', datetime.date(1995,5,9)]], ['name', 'birthday'])
+----+----------+|name| birthday|+----+----------+|Alex|1998-12-16|| Bob|1995-05-09|+----+----------+
Here, birthday is of type date:
root |-- name: string (nullable = true) |-- birthday: date (nullable = true)
Use the F.min(~) method to get the earliest date, and use the F.max(~) method to get the latest date:
+----------+----------+| earliest| latest|+----------+----------+|1995-05-09|1998-12-16|+----------+----------+
Here, we are using the alias(~) method to assign a label to the PySpark column returned by F.min(~) and F.max(~).
To extract the earliest and latest dates as variables instead of a PySpark DataFrame:
print(f'Earliest date: {list_rows[0][0]}') # type is datetime.dateprint(f'Latest date: {list_rows[0][1]}')
Earliest date: 1995-05-09Latest date: 1998-12-16
Here, we are using the PySpark DataFrame's collect() method to convert the row into a list of Row object in the driver node:
list_rows
[Row(earliest=datetime.date(1995, 5, 9), latest=datetime.date(1998, 12, 16))]
Getting earliest and latest date for date string columns
The above solution works when the column is of type date. If you have date strings, then you must first convert the date strings into native dates using the to_date(~) method.
For example, consider the following PySpark DataFrame with some date strings:
+----+----------+|name| birthday|+----+----------+|Alex|1998-12-16|| Bob| 1995-5-9|+----+----------+
We can convert the date strings to native dates using to_date(~):
Here, the second argument of to_date(~) specifies the format of the date string.