**Pandas**

*chevron_left*

**Functions and Aggregations**

# Pandas DataFrame | groupby method

*schedule*Aug 10, 2023

*toc*Table of Contents

*expand_more*

**mathematics behind data science**with 100+ top-tier guides

Start your free 7-days trial now!

Pandas's `DataFrame.groupby(~)`

divides up your DataFrame into groups based on the specified criteria. The returned value is useful because it allows you to compute statistics (e.g. mean and min) and apply transformations group-wise.

# Parameters

1. `by`

| `scalar`

or `array-like`

or `dict`

The criteria by which to divide up the DataFrame.

2. `axis`

| `int`

or `string`

| `optional`

Whether to divide the DataFrame into columns or rows:

Axis | Description |
---|---|

| DataFrame will be divided into columns. |

| DataFrame will be divided into rows. |

By default, `axis=0`

.

3. `level`

| `int`

or `string`

| `optional`

The level to target. This is only relevant if the source DataFrame is multi-index. By default, `level=None`

.

4. `as_index`

link | `boolean`

| `optional`

Whether or not the group labels is to be used as the index of the resulting DataFrame. By default, `as_index=True`

.

5. `sort`

link | `boolean`

| `optional`

Whether or not to sort the groups by group labels. By default, `sort=True`

. For performance gains, consider passing `False`

when this behaviour is not needed.

6. `group_keys`

link | `boolean`

| `optional`

Whether or not to include the group labels in the index if the aggregation function we apply changes the index. Check examples below for clarification. By default, `group_keys=True`

.

7. `squeeze`

link | `boolean`

| `optional`

Whether or not to return a simplified type if possible. Consult examples below for clarification, By default, `squeeze=False`

.

# Return Value

A `DataFrameGroupBy`

object.

# Examples

Consider the following DataFrame:

```
df = pd.DataFrame({"price":[200,300,700,900], "brand": ["apple","google","apple","google"], "device":["phone","phone","computer","phone"]})df
price brand device0 200 apple phone1 300 google phone2 700 apple computer3 900 google phone
```

## Grouping by a single column

To divide up the DataFrame using the categories of the `brand`

column:

```
groups_brand = df.groupby("brand") # Returns a groupby object
```

We can look at the partitions using the property `groups`

:

```
df.groupby("brand").groups
{'apple': Int64Index([0, 2], dtype='int64'), 'google': Int64Index([1, 3], dtype='int64')}
```

As we can see, the DataFrame was divided up into two groups: `apple`

brand and `google`

brand. The rows corresponding to `apple`

brand are index `0`

and index `2`

, whereas those corresponding to `google`

brand are index `1`

and `3`

.

There are many things you can do with this returned `groupby`

object, such as computing the mean price of each brand:

```
pricebrandapple 450google 600
```

Notice how the `mean`

of the price column was computed, while that of the `device`

column was not, even though we did not explicitly specify the columns. This is because the aggregate functions only apply to numeric columns only.

## Grouping by multiple columns

To determine the mean price of apple phones, apple computers, google phones and google computers:

```
pricebrand device apple computer 700 phone 200google phone 600
```

Note that `google`

computers do not exist in `df`

, so that's why you don't see them in the output.

Just for your reference, we show `df`

here again:

```
df
price brand device0 200 apple phone1 300 google phone2 700 apple computer3 900 google phone
```

## Iterating through the groupby object

To iterate through all groups of a `groupby`

object:

```
```

## Using aggregate functions

To compute the mean price of `apple`

and `google`

devices:

```
pricebrandapple 450google 600
```

Note that the `groupby`

object has instance methods for common aggregates:

```
pricebrand apple 450google 600
```

You can also pass in a list of functions for `agg`

to compute multiple aggregates:

```
```

Click here for more information about aggregate functions.

## Applying transformation group-wise

Here is `df`

again for your reference:

```
df
price brand device0 200 apple phone1 300 google phone2 700 apple computer3 900 google phone
```

To apply a transformation group-wise, use `transform(~)`

:

```
```

Note the following:

this is a completely arbitrary example where we are shifting each value by the maximum of the group it belongs in.

we obtain

`-500`

in index`0`

because the maximum price of the group`brand=apple`

is`700`

, and so`200-700=-500`

.the argument passed into our function (

`col`

) is of type`Series`

, and it represents a single column of a group.the return type of the entire code snippet is a

`DataFrame`

.

On a more practical note, we often apply transformations group-wise for standardisation.

## Including only a subset of columns in the returned result

Consider the following DataFrame:

```
df = pd.DataFrame({"price":[200,300,700,900], "rating":[3,4,5,3], "brand":["apple","google","apple","google"]})df
pricing rating brand0 200 3 apple1 300 4 google2 700 5 apple3 900 3 google
```

We have two numerical columns here: `price`

and `rating`

. By default, calling aggregates like `mean()`

results in computing the aggregates for all numeric columns. For instance:

```
price ratingbrand apple 450.0 4.0google 600.0 3.5
```

To compute aggregates for only the `price`

column, you can use `[]`

notation directly on the `groupby`

object:

```
brandapple 450google 600Name: price, dtype: int64
```

### Using keyword arguments to name columns

When using the `agg`

method, you can assign column labels to the resulting DataFrame by providing a keyword argument:

```
mean_pricebrand apple 450google 600
```

For this to work, you must specify which column to aggregate (`"price"`

in this case).

## Specifying as_index

Consider the following DataFrame:

```
df
price branda 200 Cb 300 Bc 700 Ad 900 B
```

By default, `as_index=True`

, which means that the group labels will be used as the index of the resulting DataFrame:

```
pricebrand A 700B 600C 200
```

Setting `as_index=False`

would set the group labels as a column instead:

```
brand price0 A 7001 B 6002 C 200
```

Notice how the index is the default integer index (`[0,1,2]`

).

## Specifying sort

Consider the following DataFrame:

```
df
price branda 200 Cb 300 Bc 700 Ad 900 B
```

By default, `sort=True`

, which means that the group labels will be sorted:

```
pricebrand A 700B 600C 200
```

Notice how the index of the resulting DataFrame is sorted in ascending order.

If we don't need this behaviour, then, set `sort=False`

like so:

```
pricebrand C 200B 600A 700
```

## Specifying group_keys

Consider the following DataFrame:

```
df = pd.DataFrame({"price":[200,300,700, 900], "brand":["apple","google","apple","google"]}, index=["a","b","c","d"])df
price branda 200 appleb 300 googlec 700 appled 900 google
```

By default, `group_keys=True`

, which means that group names will be in the index of the resulting DataFrame:

```
index price brandbrand apple 0 a 200 apple 1 c 700 applegoogle 0 b 300 google 1 d 900 google
```

Note that `group_keys`

takes effect only when the function we apply changes the index of the resulting DataFrame, just as with `reset_index(~)`

.

Setting `group_keys=False`

would remove the group names from the resulting DataFrame:

```
index price brand0 a 200 apple1 c 700 apple0 b 300 google1 d 900 google
```

Notice how the brands are no longer present in the index.

## Specifying squeeze

By default, `squeeze=False`

, which means that the return type will not be simplified even when possible.

Consider the following DataFrame:

```
df
A0 21 3
```

This is a completely arbitrary example, but suppose we group by column `A`

, and then for each group, apply a function that literally just returns a Series containing `"b"`

:

```
```

Here, the return type is `DataFrame`

, which could be simplified down to a `Series`

.

To simplify our return type to `Series`

, set `squeeze=True`

in `groupby(~)`

like so:

```
```

Here, the return type is `Series`

.