PySpark SQL Functions | collect_list method
PySpark SQL functions'
collect_list(~) method returns a list of values in a column. Unlike
collect_set(~), the returned list can contain duplicate values. Null values are ignored.
The column label or a
A PySpark SQL
Column object (
Assume that the order of the returned list may be random since the order is affected by shuffle operations.
Consider the following PySpark DataFrame:
Getting a list of column values in PySpark
To get the a list of values in the
Notice the following:
we have duplicate values (
null values are ignored.
Equivalently, you can pass in a
Column object to
collect_list(~) as well:
Obtaining a standard list
To obtain a standard list instead:
collect() method returns the content of the PySpark DataFrame returned by
select(~) as a list of
Row objects. This list is guaranteed to be of length one because
collect_list(~) collects the values into a single list. Finally, we access the content of the
Row object using
Getting a list of column values for each group in PySpark
collect_list(~) is often used in the context of aggregation. Consider the same PySpark DataFrame as above:
+-----+-----+| name|group|+-----+-----+| Alex| A|| Alex| B|| Bob| A||Cathy| C|| Dave| null|+-----+-----+
To flatten the
group column into a single list for each