data = [("Alex", "A"), ("Alex", "B"), ("Bob", "A"), ("Cathy", "C"), ("Dave", None)]
df = spark.createDataFrame(data, ["name", "group"])
df.show()
                
            
            +-----+-----+
| name|group|
+-----+-----+
| Alex|    A|
| Alex|    B|
|  Bob|    A|
|Cathy|    C|
| Dave| null|
+-----+-----+

Getting a set of column values in PySpark

To get the unique set of values in the group column:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.select(F.collect_set("group")).show()
                
            
            +------------------+
|collect_set(group)|
+------------------+
|         [C, B, A]|
+------------------+

Equivalently, you can pass in a Column object to collect_set(~) as well:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.select(F.collect_set(df.group)).show()
                
            
            +------------------+
|collect_set(group)|
+------------------+
|         [C, B, A]|
+------------------+

Notice how the null value does not appear in the resulting set.

Getting the set as a standard list

To get the set as a standard list:


        
        
            
                
                
                    list_rows = df.select(F.collect_set(df.group)).collect()
list_rows[0][0]
                
            
            ['C', 'B', 'A']

Here, the PySpark DataFrame's collect() method returns a list of Row objects. This list is guaranteed to be length one due to the nature of collect_set(~). The Row object contains the list so we need to include another [0].

Getting a set of column values of each group in PySpark

The method collect_set(~) is often used in the context of aggregation. Consider the same PySpark DataFrame as before:


        
        
            
                
                
                    df.show()
                
            
            +-----+-----+
| name|group|
+-----+-----+
| Alex|    A|
| Alex|    B|
|  Bob|    A|
|Cathy|    C|
| Dave| null|
+-----+-----+

To flatten the group column into a single set for each name:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.groupby("name").agg(F.collect_set("group")).show()
                
            
            +-----+------------------+
| name|collect_set(group)|
+-----+------------------+
| Alex|            [B, A]|
|  Bob|               [A]|
|Cathy|               [C]|
+-----+------------------+

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.collect_set.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!