PySpark SQL Functions | collect_set method
PySpark SQL Functions'
collect_set(~) method returns a unique set of values in a column. Null values are ignored.
collect_list(~) instead to obtain a list of values that allows for duplicates.
The column label or a
A PySpark SQL
Column object (
Assume that the order of the returned set may be random since the order is affected by shuffle operations.
Consider the following PySpark DataFrame:
Getting a set of column values in PySpark
To get the unique set of values in the
Equivalently, you can pass in a
Column object to
collect_set(~) as well:
Notice how the
null value does not appear in the resulting set.
Getting the set as a standard list
To get the set as a standard list:
Here, the PySpark DataFrame's
collect() method returns a list of
Row objects. This list is guaranteed to be length one due to the nature of
Row object contains the list so we need to include another
Getting a set of column values of each group in PySpark
collect_set(~) is often used in the context of aggregation. Consider the same PySpark DataFrame as before:
+-----+-----+| name|group|+-----+-----+| Alex| A|| Alex| B|| Bob| A||Cathy| C|| Dave| null|+-----+-----+
To flatten the
group column into a single set for each