*chevron_left*Data Aggregation Cookbook

# Difference between methods apply and transform for groupby in Pandas

*chevron_right*

*chevron_right*

*chevron_right*

*chevron_right*

*chevron_right*

*schedule*Mar 9, 2022

*toc*Table of Contents

*expand_more*

The main differences are the input and output of the argument function:

Input | Output | |
---|---|---|

A scalar, a sequence or a DataFrame. | A | |

A | A sequence that has the same length as the input |

What this means is that `apply(~)`

allows you perform operations on columns, rows and the entire DataFrame of each group, whereas `transform(~)`

is restricted to operations on individual columns of each group.

# Examples

## Difference in input

Consider the following DataFrame:

```
df = pd.DataFrame({"A":[2,5,4],"B":[10,100,8],"group":["a","a","b"]})df
A B group0 2 10 a1 5 100 a2 4 8 b
```

To compute the cumulative sum of rows of each group, you must use `apply()`

:

```
# my_df is a DataFrame representing each groupdef f(my_df): # returns a DataFrame return my_df.cumsum(axis=1)
```

df.groupby("group").apply(f)
A B0 2 121 5 1052 4 12

Here, our function `f`

is called twice - once for each group. Here, `transform(f)`

would not work because `transform(f)`

only allows for operations involving individual columns, and so row operations are not allowed.

To compute the cumulative sum of columns of each group, you can use `transform(f)`

:

```
# my_col is a Series representing a single column of each groupdef f(my_col): # returns a Series return my_col.cumsum()
```

df.groupby("group").transform(f)
A B0 2 101 7 1102 4 8

Here, our function `f`

is called 4 times since we have two groups and each group we have two columns.

In most cases, using `apply(f)`

instead of `transform(f)`

would produce identical results since many of the DataFrame's operations, including `cumsum(~)`

, are performed for each column by default.

## Difference in output

Consider the same DataFrame as before:

```
df = pd.DataFrame({"A":[2,5,4],"B":[10,100,8],"group":["a","a","b"]})df
A B group0 2 10 a1 5 100 a2 4 8 b
```

Returning a scalar for `apply(~)`

yields:

```
def f(my_df): # return the maximum value (scalar) in the entire my_df for each group return my_df.max().max()
```

df.groupby("group").apply(f) # returns a Series
groupa 100b 8dtype: int64

Returning a scalar for `transform(~)`

yields:

```
# my_col is a Series representing a single column of each groupdef f(my_col): # maximum value (scalar) in column gets broadcasted to become a Series of the same length as my_col return my_col.max()
```

df.groupby("group").transform(f) # returns a DataFrame
A B0 5 1001 5 1002 4 8