search
Search
Publish
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe: "Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
share
thumb_up_alt
bookmark
arrow_backShare
Twitter
Facebook
thumb_up
0
thumb_down
0
chat_bubble_outline
0
auto_stories new
settings

Pandas DataFrame | groupby method

Programming
chevron_right
Python
chevron_right
Pandas
chevron_right
Documentation
chevron_right
DataFrame
chevron_right
Functions and Aggregations
schedule Jul 1, 2022
Last updated
local_offer PythonPandas
Tags

Pandas's DataFrame.groupby(~) divides up your DataFrame into groups based on the specified criteria. The returned value is useful because it allows you to compute statistics (e.g. mean and min) and apply transformations group-wise.

Parameters

1. by | scalar or array-like or dict

The criteria by which to divide up the DataFrame.

2. axis | int or string | optional

Whether to divide the DataFrame into columns or rows:

Axis

Description

DataFrame will be divided into columns.

0 or "index"

DataFrame will be divided into rows.

1 or "columns"

By default, axis=0.

3. level | int or string | optional

The level to target. This is only relevant if the source DataFrame is multi-index. By default, level=None.

4. as_indexlink | boolean | optional

Whether or not the group labels is to be used as the index of the resulting DataFrame. By default, as_index=True.

5. sortlink | boolean | optional

Whether or not to sort the groups by group labels. By default, sort=True. For performance gains, consider passing False when this behaviour is not needed.

6. group_keyslink | boolean | optional

Whether or not to include the group labels in the index if the aggregation function we apply changes the index. Check examples below for clarification. By default, group_keys=True.

7. squeezelink | boolean | optional

Whether or not to return a simplified type if possible. Consult examples below for clarification, By default, squeeze=False.

Return Value

A DataFrameGroupBy object.

Examples

Consider the following DataFrame:

df = pd.DataFrame({"price":[200,300,700,900], "brand": ["apple","google","apple","google"], "device":["phone","phone","computer","phone"]})
df
   price  brand   device
0  200    apple   phone
1  300    google  phone
2  700    apple   computer
3  900    google  phone

Grouping by a single column

To divide up the DataFrame using the categories of the brand column:

groups_brand = df.groupby("brand")      # Returns a groupby object

We can look at the partitions using the property groups:

df.groupby("brand").groups
{'apple': Int64Index([0, 2], dtype='int64'),
 'google': Int64Index([1, 3], dtype='int64')}

As we can see, the DataFrame was divided up into two groups: apple brand and google brand. The rows corresponding to apple brand are index 0 and index 2, whereas those corresponding to google brand are index 1 and 3.

There are many things you can do with this returned groupby object, such as computing the mean price of each brand:

groups_brand.mean()   # Returns a Series
        price
brand
apple   450
google  600

Notice how the mean of the price column was computed, while that of the device column was not, even though we did not explicitly specify the columns. This is because the aggregate functions only apply to numeric columns only.

Grouping by multiple columns

To determine the mean price of apple phones, apple computers, google phones and google computers:

df.groupby(["brand","device"]).mean()
                    price
brand    device
apple    computer   700
         phone      200
google   phone      600

Note that google computers do not exist in df, so that's why you don't see them in the output.

Just for your reference, we show df here again:

df
   price  brand   device
0   200   apple   phone
1   300   google  phone
2   700   apple   computer
3   900   google  phone

Iterating through the groupby object

To iterate through all groups of a groupby object:

for group_name, group in df.groupby("brand"):
print("group_name:", group_name)
print(group) # DataFrame
group_name: apple
   price  brand  device
0   200   apple  phone
2   700   apple  computer
group_name: google
   price  brand   device
1   300   google  phone
3   900   google  phone

Using aggregate functions

To compute the mean price of apple and google devices:

df.groupby("brand").agg("mean")   # Returns a DataFrame
        price
brand
apple   450
google  600

Note that the groupby object has instance methods for common aggregates:

df.groupby("brand").mean()
        price
brand
apple   450
google  600

You can also pass in a list of functions for agg to compute multiple aggregates:

df.groupby("brand").agg(["mean", np.max])
price
mean amax
brand
apple 450 700
google 600 900

Click here for more information about aggregate functions.

Applying transformation group-wise

Here is df again for your reference:

df
   price  brand   device
0   200   apple   phone
1   300   google  phone
2   700   apple   computer
3   900   google  phone

To apply a transformation group-wise, use transform(~):

df.groupby("brand").transform(lambda col: col - col.max())
   price
0  -500
1  -600
2   0
3   0

Note the following:

  • this is a completely arbitrary example where we are shifting each value by the maximum of the group it belongs in.

  • we obtain -500 in index 0 because the maximum price of the group brand=apple is 700, and so 200-700=-500.

  • the argument passed into our function (col) is of type Series, and it represents a single column of a group.

  • the return type of the entire code snippet is a DataFrame.

On a more practical note, we often apply transformations group-wise for standardisation.

Including only a subset of columns in the returned result

Consider the following DataFrame:

df = pd.DataFrame({"price":[200,300,700,900], "rating":[3,4,5,3], "brand":["apple","google","apple","google"]})
df
   pricing rating  brand
0   200    3     apple
1   300    4     google
2   700    5     apple
3   900    3     google

We have two numerical columns here: price and rating. By default, calling aggregates like mean() results in computing the aggregates for all numeric columns. For instance:

df.groupby("brand").mean()     # Returns a DataFrame
         price   rating
brand
apple    450.0   4.0
google   600.0   3.5

To compute aggregates for only the price column, you can use [] notation directly on the groupby object:

df.groupby("brand")["price"].mean()   # Returns a Series
brand
apple 450
google 600
Name: price, dtype: int64

Using keyword arguments to name columns

When using the agg method, you can assign column labels to the resulting DataFrame by providing a keyword argument:

df.groupby("brand")["price"].agg(mean_price="mean")
mean_price
brand
apple 450
google 600

For this to work, you must specify which column to aggregate ("price" in this case).

Specifying as_index

Consider the following DataFrame:

df = pd.DataFrame({"price":[200,300,700,900], "brand": ["C","B","A","B"]}, index=["a","b","c","d"])
df
price brand
a 200 C
b 300 B
c 700 A
d 900 B

By default, as_index=True, which means that the group labels will be used as the index of the resulting DataFrame:

df.groupby("brand", as_index=True).mean()
price
brand
A 700
B 600
C 200

Setting as_index=False would set the group labels as a column instead:

df.groupby("brand", as_index=False).mean()
brand price
0 A 700
1 B 600
2 C 200

Notice how the index is the default integer index ([0,1,2]).

Specifying sort

Consider the following DataFrame:

df = pd.DataFrame({"price":[200,300,700,900], "brand":["C","B","A","B"]}, index=["a","b","c","d"])
df
price brand
a 200 C
b 300 B
c 700 A
d 900 B

By default, sort=True, which means that the group labels will be sorted:

df.groupby("brand", sort=True).mean()
price
brand
A 700
B 600
C 200

Notice how the index of the resulting DataFrame is sorted in ascending order.

If we don't need this behaviour, then, set sort=False like so:

df.groupby("brand", sort=False).mean()
price
brand
C 200
B 600
A 700

Specifying group_keys

Consider the following DataFrame:

df = pd.DataFrame({"price":[200,300,700, 900], "brand":["apple","google","apple","google"]}, index=["a","b","c","d"])
df
price brand
a 200 apple
b 300 google
c 700 apple
d 900 google

By default, group_keys=True, which means that group names will be in the index of the resulting DataFrame:

df.groupby("brand", group_keys=True).apply(lambda x: x.reset_index())
index price brand
brand
apple 0 a 200 apple
1 c 700 apple
google 0 b 300 google
1 d 900 google

Note that group_keys takes effect only when the function we apply changes the index of the resulting DataFrame, just as with reset_index(~).

Setting group_keys=False would remove the group names from the resulting DataFrame:

df.groupby("brand", group_keys=False).apply(lambda x: x.reset_index())
index price brand
0 a 200 apple
1 c 700 apple
0 b 300 google
1 d 900 google

Notice how the brands are no longer present in the index.

Specifying squeeze

By default, squeeze=False, which means that the return type will not be simplified even when possible.

Consider the following DataFrame:

df = pd.DataFrame({"A":[2,3]})
df
A
0 2
1 3

This is a completely arbitrary example, but suppose we group by column A, and then for each group, apply a function that literally just returns a Series containing "b":

df.groupby("A").apply(lambda x: pd.Series(["b"]))
0
A
2 b
3 b

Here, the return type is DataFrame, which could be simplified down to a Series.

To simplify our return type to Series, set squeeze=True in groupby(~) like so:

df.groupby("a", squeeze=True).apply(lambda x: pd.Series(["b"]))
0 b
0 b
dtype: object

Here, the return type is Series.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Ask a question or leave a feedback...