df = spark.createDataFrame([["Alex", "IT", 20, 500],\
                            ["Bob", "IT", 24, 400],\
                            ["Cathy", "HR", 22, 600]],\
                            ["name", "department", "age", "salary"])
df.show()
                
            
            +-----+----------+---+------+
| name|department|age|salary|
+-----+----------+---+------+
| Alex|        IT| 20|   500|
|  Bob|        IT| 24|   400|
|Cathy|        HR| 22|   600|
+-----+----------+---+------+

Basic usage

By default, groupBy() without any arguments will group all rows together, and will compute statistics for each numeric column:


        
        
            
                
                
                    df.groupby().max().show()
                
            
            +--------+-----------+
|max(age)|max(salary)|
+--------+-----------+
|      24|        600|
+--------+-----------+

Grouping by a single column and computing statistic of all columns of each group

To get the highest age and salary in each department:


        
        
            
                
                
                    df.groupBy("department").max().show()
                
            
            +----------+--------+-----------+
|department|max(age)|max(salary)|
+----------+--------+-----------+
|        IT|      24|        500|
|        HR|      22|        600|
+----------+--------+-----------+

Instead of referring to the column by its label (string), we can also use SQL.functions.col(~):


        
        
            
                
                
                    from pyspark.sql import functions as F
df.groupby(F.col("department")).max().show()
                
            
            +----------+--------+-----------+
|department|max(age)|max(salary)|
+----------+--------+-----------+
|        IT|      24|        500|
|        HR|      22|        600|
+----------+--------+-----------+

Grouping by a single column and computing statistic of specific columns of each group

To get the highest age only instead of all numeric columns:


        
        
            
                
                
                    df.groupby("department").max("age").show()
                
            
            +----------+--------+
|department|max(age)|
+----------+--------+
|        IT|      24|
|        HR|      22|
+----------+--------+

Equivalently, we can use the agg(~) method and use one of SQL.functions' aggregate functions:


        
        
            
                
                
                    df.groupby("department").agg(F.max("age")).show()
                
            
            +----------+--------+
|department|max(age)|
+----------+--------+
|        IT|      24|
|        HR|      22|
+----------+--------+

NOTE

The following aggregate functions are supported in PySpark:


        
        
            
                
                
                    agg, avg, count, max, mean, min, pivot, sum

Grouping by with aliases for the aggregated column

By default, computing the max age of each group will result in the column label max(age):


        
        
            
                
                
                    df.groupby("department").max("age").show()
                
            
            +----------+--------+
|department|max(age)|
+----------+--------+
|        IT|      24|
|        HR|      22|
+----------+--------+

To use an alias, we need to use the function agg(~) instead:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.groupby("department").agg(F.max("age").alias("max_age")).show()
                
            
            +----------+-------+
|department|max_age|
+----------+-------+
|        IT|     24|
|        HR|     22|
+----------+-------+

Grouping by and computing multiple statistics

To compute multiple statistics at once:


        
        
            
                
                
                    import pyspark.sql.functions as F
df.groupby("department").agg(F.max("age").alias("max"), F.min("age"), F.avg("salary")).show()
                
            
            +----------+--------+--------+-----------------+
|department|     max|min(age)|      avg(salary)|
+----------+--------+--------+-----------------+
|        IT|      26|      20|566.6666666666666|
|        HR|      22|      22|            600.0|
+----------+--------+--------+-----------------+

Grouping by multiple columns and computing statistic

Consider the following PySpark DataFrame:


        
        
            
                
                
                    df = spark.createDataFrame([["Alex", "junior", "IT", 20, 500],\
                            ["Bob", "junior", "IT", 24, 400],\
                            ["Cathy", "junior", "HR", 22, 600],\
                            ["Doge", "senior", "IT", 26, 800]],\
                            ["name", "position", "department", "age", "salary"])
df.show()
                
            
            +-----+--------+----------+---+------+
| name|position|department|age|salary|
+-----+--------+----------+---+------+
| Alex|  junior|        IT| 20|   500|
|  Bob|  junior|        IT| 24|   400|
|Cathy|  junior|        HR| 22|   600|
| Doge|  senior|        IT| 26|   800|
+-----+--------+----------+---+------+

To group by position and department, and then computing the max age of each of these groups:


        
        
            
                
                
                    df.groupby(["position", "department"]).max("age").show()
                
            
            +--------+----------+--------+
|position|department|max(age)|
+--------+----------+--------+
|  junior|        IT|      24|
|  junior|        HR|      22|
|  senior|        IT|      26|
+--------+----------+--------+

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.groupBy.html?highlight=groupby#pyspark.sql.DataFrame.groupBy

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!