data = [("Alex", "A"), ("Alex", "B"), ("Bob", "A"), ("Cathy", "C"), ("Dave", None)]
df = spark.createDataFrame(data, ["name", "group"])
df.show()
                
            
            +-----+-----+
| name|group|
+-----+-----+
| Alex|    A|
| Alex|    B|
|  Bob|    A|
|Cathy|    C|
| Dave| null|
+-----+-----+

Getting a list of column values in PySpark

To get the a list of values in the group column:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.select(F.collect_list("group")).show()
                
            
            +-------------------+
|collect_list(group)|
+-------------------+
|       [A, B, A, C]|
+-------------------+

Notice the following:

we have duplicate values (A).
null values are ignored.

Equivalently, you can pass in a Column object to collect_list(~) as well:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.select(F.collect_list(df.group)).show()
                
            
            +-------------------+
|collect_list(group)|
+-------------------+
|       [A, B, A, C]|
+-------------------+

Obtaining a standard list

To obtain a standard list instead:


        
        
            
                
                
                    list_rows = df.select(F.collect_list(df.group)).collect()
list_rows[0][0]
                
            
            ['A', 'B', 'A', 'C']

Here, the collect() method returns the content of the PySpark DataFrame returned by select(~) as a list of Row objects. This list is guaranteed to be of length one because collect_list(~) collects the values into a single list. Finally, we access the content of the Row object using [0].

Getting a list of column values for each group in PySpark

The method collect_list(~) is often used in the context of aggregation. Consider the same PySpark DataFrame as above:


        
        
            
                
                
                    df.show()
                
            
            +-----+-----+
| name|group|
+-----+-----+
| Alex|    A|
| Alex|    B|
|  Bob|    A|
|Cathy|    C|
| Dave| null|
+-----+-----+

To flatten the group column into a single list for each name:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.groupby("name").agg(F.collect_list("group")).show()
                
            
            +-----+-------------------+
| name|collect_list(group)|
+-----+-------------------+
| Alex|             [A, B]|
|  Bob|                [A]|
|Cathy|                [C]|
| Dave|                 []|
+-----+-------------------+

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.collect_list.html#pyspark.sql.functions.collect_list

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!