Pandas DataFrame | groupby method
Start your free 7-days trial now!
Pandas's DataFrame.groupby(~) divides up your DataFrame into groups based on the specified criteria. The returned value is useful because it allows you to compute statistics (e.g. mean and min) and apply transformations group-wise.
Parameters
1. by | scalar or array-like or dict
The criteria by which to divide up the DataFrame.
2. axis | int or string | optional
Whether to divide the DataFrame into columns or rows:
Axis | Description |
|---|---|
| DataFrame will be divided into columns. |
| DataFrame will be divided into rows. |
By default, axis=0.
3. level | int or string | optional
The level to target. This is only relevant if the source DataFrame is multi-index. By default, level=None.
4. as_indexlink | boolean | optional
Whether or not the group labels is to be used as the index of the resulting DataFrame. By default, as_index=True.
5. sortlink | boolean | optional
Whether or not to sort the groups by group labels. By default, sort=True. For performance gains, consider passing False when this behaviour is not needed.
6. group_keyslink | boolean | optional
Whether or not to include the group labels in the index if the aggregation function we apply changes the index. Check examples below for clarification. By default, group_keys=True.
7. squeezelink | boolean | optional
Whether or not to return a simplified type if possible. Consult examples below for clarification, By default, squeeze=False.
Return Value
A DataFrameGroupBy object.
Examples
Consider the following DataFrame:
df = pd.DataFrame({"price":[200,300,700,900], "brand": ["apple","google","apple","google"], "device":["phone","phone","computer","phone"]})df
price brand device0 200 apple phone1 300 google phone2 700 apple computer3 900 google phone
Grouping by a single column
To divide up the DataFrame using the categories of the brand column:
groups_brand = df.groupby("brand") # Returns a groupby object
We can look at the partitions using the property groups:
df.groupby("brand").groups
{'apple': Int64Index([0, 2], dtype='int64'), 'google': Int64Index([1, 3], dtype='int64')}
As we can see, the DataFrame was divided up into two groups: apple brand and google brand. The rows corresponding to apple brand are index 0 and index 2, whereas those corresponding to google brand are index 1 and 3.
There are many things you can do with this returned groupby object, such as computing the mean price of each brand:
pricebrandapple 450google 600
Notice how the mean of the price column was computed, while that of the device column was not, even though we did not explicitly specify the columns. This is because the aggregate functions only apply to numeric columns only.
Grouping by multiple columns
To determine the mean price of apple phones, apple computers, google phones and google computers:
pricebrand device apple computer 700 phone 200google phone 600
Note that google computers do not exist in df, so that's why you don't see them in the output.
Just for your reference, we show df here again:
df
price brand device0 200 apple phone1 300 google phone2 700 apple computer3 900 google phone
Iterating through the groupby object
To iterate through all groups of a groupby object:
Using aggregate functions
To compute the mean price of apple and google devices:
pricebrandapple 450google 600
Note that the groupby object has instance methods for common aggregates:
pricebrand apple 450google 600
You can also pass in a list of functions for agg to compute multiple aggregates:
Click here for more information about aggregate functions.
Applying transformation group-wise
Here is df again for your reference:
df
price brand device0 200 apple phone1 300 google phone2 700 apple computer3 900 google phone
To apply a transformation group-wise, use transform(~):
Note the following:
this is a completely arbitrary example where we are shifting each value by the maximum of the group it belongs in.
we obtain
-500in index0because the maximum price of the groupbrand=appleis700, and so200-700=-500.the argument passed into our function (
col) is of typeSeries, and it represents a single column of a group.the return type of the entire code snippet is a
DataFrame.
On a more practical note, we often apply transformations group-wise for standardisation.
Including only a subset of columns in the returned result
Consider the following DataFrame:
df = pd.DataFrame({"price":[200,300,700,900], "rating":[3,4,5,3], "brand":["apple","google","apple","google"]})df
pricing rating brand0 200 3 apple1 300 4 google2 700 5 apple3 900 3 google
We have two numerical columns here: price and rating. By default, calling aggregates like mean() results in computing the aggregates for all numeric columns. For instance:
price ratingbrand apple 450.0 4.0google 600.0 3.5
To compute aggregates for only the price column, you can use [] notation directly on the groupby object:
brandapple 450google 600Name: price, dtype: int64
Using keyword arguments to name columns
When using the agg method, you can assign column labels to the resulting DataFrame by providing a keyword argument:
mean_pricebrand apple 450google 600
For this to work, you must specify which column to aggregate ("price" in this case).
Specifying as_index
Consider the following DataFrame:
df
price branda 200 Cb 300 Bc 700 Ad 900 B
By default, as_index=True, which means that the group labels will be used as the index of the resulting DataFrame:
pricebrand A 700B 600C 200
Setting as_index=False would set the group labels as a column instead:
brand price0 A 7001 B 6002 C 200
Notice how the index is the default integer index ([0,1,2]).
Specifying sort
Consider the following DataFrame:
df
price branda 200 Cb 300 Bc 700 Ad 900 B
By default, sort=True, which means that the group labels will be sorted:
pricebrand A 700B 600C 200
Notice how the index of the resulting DataFrame is sorted in ascending order.
If we don't need this behaviour, then, set sort=False like so:
pricebrand C 200B 600A 700
Specifying group_keys
Consider the following DataFrame:
df = pd.DataFrame({"price":[200,300,700, 900], "brand":["apple","google","apple","google"]}, index=["a","b","c","d"])df
price branda 200 appleb 300 googlec 700 appled 900 google
By default, group_keys=True, which means that group names will be in the index of the resulting DataFrame:
index price brandbrand apple 0 a 200 apple 1 c 700 applegoogle 0 b 300 google 1 d 900 google
Note that group_keys takes effect only when the function we apply changes the index of the resulting DataFrame, just as with reset_index(~).
Setting group_keys=False would remove the group names from the resulting DataFrame:
index price brand0 a 200 apple1 c 700 apple0 b 300 google1 d 900 google
Notice how the brands are no longer present in the index.
Specifying squeeze
By default, squeeze=False, which means that the return type will not be simplified even when possible.
Consider the following DataFrame:
df
A0 21 3
This is a completely arbitrary example, but suppose we group by column A, and then for each group, apply a function that literally just returns a Series containing "b":
Here, the return type is DataFrame, which could be simplified down to a Series.
To simplify our return type to Series, set squeeze=True in groupby(~) like so:
Here, the return type is Series.