search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

PySpark DataFrame | groupBy method

schedule Aug 12, 2023
Last updated
local_offer
PySpark
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

PySpark DataFrame's groupBy(~) method aggregates rows based on the specified columns. We can then compute statistics such as the mean for each of these groups.

Parameters

1. cols | list or string or Column | optional

The columns to group by. By default, all rows will be grouped together.

Return Value

The GroupedData object (pyspark.sql.group.GroupedData).

Examples

Consider the following PySpark DataFrame:

df = spark.createDataFrame([["Alex", "IT", 20, 500],\
["Bob", "IT", 24, 400],\
["Cathy", "HR", 22, 600]],\
["name", "department", "age", "salary"])
df.show()
+-----+----------+---+------+
| name|department|age|salary|
+-----+----------+---+------+
| Alex| IT| 20| 500|
| Bob| IT| 24| 400|
|Cathy| HR| 22| 600|
+-----+----------+---+------+

Basic usage

By default, groupBy() without any arguments will group all rows together, and will compute statistics for each numeric column:

df.groupby().max().show()
+--------+-----------+
|max(age)|max(salary)|
+--------+-----------+
| 24| 600|
+--------+-----------+

Grouping by a single column and computing statistic of all columns of each group

To get the highest age and salary in each department:

df.groupBy("department").max().show()
+----------+--------+-----------+
|department|max(age)|max(salary)|
+----------+--------+-----------+
| IT| 24| 500|
| HR| 22| 600|
+----------+--------+-----------+

Instead of referring to the column by its label (string), we can also use SQL.functions.col(~):

from pyspark.sql import functions as F
df.groupby(F.col("department")).max().show()
+----------+--------+-----------+
|department|max(age)|max(salary)|
+----------+--------+-----------+
| IT| 24| 500|
| HR| 22| 600|
+----------+--------+-----------+

Grouping by a single column and computing statistic of specific columns of each group

To get the highest age only instead of all numeric columns:

df.groupby("department").max("age").show()
+----------+--------+
|department|max(age)|
+----------+--------+
| IT| 24|
| HR| 22|
+----------+--------+

Equivalently, we can use the agg(~) method and use one of SQL.functions' aggregate functions:

df.groupby("department").agg(F.max("age")).show()
+----------+--------+
|department|max(age)|
+----------+--------+
| IT| 24|
| HR| 22|
+----------+--------+
NOTE

The following aggregate functions are supported in PySpark:

agg, avg, count, max, mean, min, pivot, sum

Grouping by with aliases for the aggregated column

By default, computing the max age of each group will result in the column label max(age):

df.groupby("department").max("age").show()
+----------+--------+
|department|max(age)|
+----------+--------+
| IT| 24|
| HR| 22|
+----------+--------+

To use an alias, we need to use the function agg(~) instead:

import pyspark.sql.functions as F
df.groupby("department").agg(F.max("age").alias("max_age")).show()
+----------+-------+
|department|max_age|
+----------+-------+
| IT| 24|
| HR| 22|
+----------+-------+

Grouping by and computing multiple statistics

To compute multiple statistics at once:

import pyspark.sql.functions as F
df.groupby("department").agg(F.max("age").alias("max"), F.min("age"), F.avg("salary")).show()
+----------+--------+--------+-----------------+
|department| max|min(age)| avg(salary)|
+----------+--------+--------+-----------------+
| IT| 26| 20|566.6666666666666|
| HR| 22| 22| 600.0|
+----------+--------+--------+-----------------+

Grouping by multiple columns and computing statistic

Consider the following PySpark DataFrame:

df = spark.createDataFrame([["Alex", "junior", "IT", 20, 500],\
["Bob", "junior", "IT", 24, 400],\
["Cathy", "junior", "HR", 22, 600],\
["Doge", "senior", "IT", 26, 800]],\
["name", "position", "department", "age", "salary"])
df.show()
+-----+--------+----------+---+------+
| name|position|department|age|salary|
+-----+--------+----------+---+------+
| Alex| junior| IT| 20| 500|
| Bob| junior| IT| 24| 400|
|Cathy| junior| HR| 22| 600|
| Doge| senior| IT| 26| 800|
+-----+--------+----------+---+------+

To group by position and department, and then computing the max age of each of these groups:

df.groupby(["position", "department"]).max("age").show()
+--------+----------+--------+
|position|department|max(age)|
+--------+----------+--------+
| junior| IT| 24|
| junior| HR| 22|
| senior| IT| 26|
+--------+----------+--------+
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down