PySpark SQL Functions | collect_list method
Start your free 7-days trial now!
PySpark SQL functions' collect_list(~) method returns a list of values in a column. Unlike collect_set(~), the returned list can contain duplicate values. Null values are ignored.
Parameters
1. col | string or Column object
The column label or a Column object.
Return Value
A PySpark SQL Column object (pyspark.sql.column.Column).
Assume that the order of the returned list may be random since the order is affected by shuffle operations.
Examples
Consider the following PySpark DataFrame:
data = [("Alex", "A"), ("Alex", "B"), ("Bob", "A"), ("Cathy", "C"), ("Dave", None)]
+-----+-----+| name|group|+-----+-----+| Alex| A|| Alex| B|| Bob| A||Cathy| C|| Dave| null|+-----+-----+
Getting a list of column values in PySpark
To get the a list of values in the group column:
Notice the following:
we have duplicate values (
A).null values are ignored.
Equivalently, you can pass in a Column object to collect_list(~) as well:
Obtaining a standard list
To obtain a standard list instead:
Here, the collect() method returns the content of the PySpark DataFrame returned by select(~) as a list of Row objects. This list is guaranteed to be of length one because collect_list(~) collects the values into a single list. Finally, we access the content of the Row object using [0].
Getting a list of column values for each group in PySpark
The method collect_list(~) is often used in the context of aggregation. Consider the same PySpark DataFrame as above:
+-----+-----+| name|group|+-----+-----+| Alex| A|| Alex| B|| Bob| A||Cathy| C|| Dave| null|+-----+-----+
To flatten the group column into a single list for each name: