search
Search
Join our weekly DS/ML newsletter layers DS/ML Guides
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
brightness_medium
share
arrow_backShare
Twitter
Facebook

PySpark DataFrame | groupBy method

Machine Learning
chevron_right
PySpark
chevron_right
Documentation
chevron_right
PySpark DataFrame
schedule Jul 1, 2022
Last updated
local_offer PySpark
Tags

PySpark DataFrame's groupBy(~) method aggregates rows based on the specified columns. We can then compute statistics such as the mean for each of these groups.

Parameters

1. cols | list or string or Column | optional

The columns to group by. By default, all rows will be grouped together.

Return Value

The GroupedData object (pyspark.sql.group.GroupedData).

Examples

Consider the following PySpark DataFrame:

df = spark.createDataFrame([["Alex", "IT", 20, 500],\
["Bob", "IT", 24, 400],\
["Cathy", "HR", 22, 600]],\
["name", "department", "age", "salary"])
df.show()
+-----+----------+---+------+
| name|department|age|salary|
+-----+----------+---+------+
| Alex| IT| 20| 500|
| Bob| IT| 24| 400|
|Cathy| HR| 22| 600|
+-----+----------+---+------+

Basic usage

By default, groupBy() without any arguments will group all rows together, and will compute statistics for each numeric column:

df.groupby().max().show()
+--------+-----------+
|max(age)|max(salary)|
+--------+-----------+
| 24| 600|
+--------+-----------+

Grouping by a single column and computing statistic of all columns of each group

To get the highest age and salary in each department:

df.groupBy("department").max().show()
+----------+--------+-----------+
|department|max(age)|max(salary)|
+----------+--------+-----------+
| IT| 24| 500|
| HR| 22| 600|
+----------+--------+-----------+

Instead of referring to the column by its label (string), we can also use SQL.functions.col(~):

from pyspark.sql import functions as F
df.groupby(F.col("department")).max().show()
+----------+--------+-----------+
|department|max(age)|max(salary)|
+----------+--------+-----------+
| IT| 24| 500|
| HR| 22| 600|
+----------+--------+-----------+

Grouping by a single column and computing statistic of specific columns of each group

To get the highest age only instead of all numeric columns:

df.groupby("department").max("age").show()
+----------+--------+
|department|max(age)|
+----------+--------+
| IT| 24|
| HR| 22|
+----------+--------+

Equivalently, we can use the agg(~) method and use one of SQL.functions' aggregate functions:

df.groupby("department").agg(F.max("age")).show()
+----------+--------+
|department|max(age)|
+----------+--------+
| IT| 24|
| HR| 22|
+----------+--------+

Grouping by with aliases for the aggregated column

By default, computing the max age of each group will result in the column label max(age):

df.groupby("department").max("age").show()
+----------+--------+
|department|max(age)|
+----------+--------+
| IT| 24|
| HR| 22|
+----------+--------+

To use an alias, we need to use the function agg(~) instead:

import pyspark.sql.functions as F
df.groupby("department").agg(F.max("age").alias("max_age")).show()
+----------+-------+
|department|max_age|
+----------+-------+
| IT| 24|
| HR| 22|
+----------+-------+

Grouping by and computing multiple statistics

To compute multiple statistics at once:

import pyspark.sql.functions as F
df.groupby("department").agg(F.max("age").alias("max"), F.min("age"), F.avg("salary")).show()
+----------+--------+--------+-----------------+
|department| max|min(age)| avg(salary)|
+----------+--------+--------+-----------------+
| IT| 26| 20|566.6666666666666|
| HR| 22| 22| 600.0|
+----------+--------+--------+-----------------+

Grouping by multiple columns and computing statistic

Consider the following PySpark DataFrame:

df = spark.createDataFrame([["Alex", "junior", "IT", 20, 500],\
["Bob", "junior", "IT", 24, 400],\
["Cathy", "junior", "HR", 22, 600],\
["Doge", "senior", "IT", 26, 800]],\
["name", "position", "department", "age", "salary"])
df.show()
+-----+--------+----------+---+------+
| name|position|department|age|salary|
+-----+--------+----------+---+------+
| Alex| junior| IT| 20| 500|
| Bob| junior| IT| 24| 400|
|Cathy| junior| HR| 22| 600|
| Doge| senior| IT| 26| 800|
+-----+--------+----------+---+------+

To group by position and department, and then computing the max age of each of these groups:

df.groupby(["position", "department"]).max("age").show()
+--------+----------+--------+
|position|department|max(age)|
+--------+----------+--------+
| junior| IT| 24|
| junior| HR| 22|
| senior| IT| 26|
+--------+----------+--------+
mail
Join our newsletter for updates on new DS/ML comprehensive guides (spam-free)
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_down