PySpark SQL Functions | collect_list method
Start your free 7-days trial now!
PySpark SQL functions'
collect_list(~) method returns a list of values in a column. Unlike
collect_set(~), the returned list can contain duplicate values. Null values are ignored.
The column label or a
A PySpark SQL
Column object (
Assume that the order of the returned list may be random since the order is affected by shuffle operations.
Consider the following PySpark DataFrame:
data = [("Alex", "A"), ("Alex", "B"), ("Bob", "A"), ("Cathy", "C"), ("Dave", None)]df = spark.createDataFrame(data, ["name", "group"])+-----+-----+| name|group|+-----+-----+| Alex| A|| Alex| B|| Bob| A||Cathy| C|| Dave| null|+-----+-----+
Getting a list of column values in PySpark
To get the a list of values in the
Notice the following:
we have duplicate values (
null values are ignored.
Equivalently, you can pass in a
Column object to
collect_list(~) as well:
Obtaining a standard list
To obtain a standard list instead:
collect() method returns the content of the PySpark DataFrame returned by
select(~) as a list of
Row objects. This list is guaranteed to be of length one because
collect_list(~) collects the values into a single list. Finally, we access the content of the
Row object using
Getting a list of column values for each group in PySpark
collect_list(~) is often used in the context of aggregation. Consider the same PySpark DataFrame as above:
+-----+-----+| name|group|+-----+-----+| Alex| A|| Alex| B|| Bob| A||Cathy| C|| Dave| null|+-----+-----+
To flatten the
group column into a single list for each