What does this mean?
Why is this true?
Give me some examples!
Creating a new column based on other columns in Pandas DataFrame

schedule Aug 10, 2023
Last updated
local_offer
PythonPandas
Tags
To create a new column based on other columns, either:

• use column-arithmetics for fastest performance.

• use NumPy's `where(~)` method for creating binary columns

• use the `apply(~)` method, which is the slowest but offers the most flexibility

• use the Series' `replace(~)` method for mapping new values from existing columns.

Creating new columns using arithmetics

Consider the following DataFrame:

``` df = pd.DataFrame({"A":[3,4],"B":[5,6]}, index=["a","b"])df A Ba 3 5b 4 6 ```

The fastest and simplest way of creating a new column is to use simple column-arithmetics:

``` df["C"] = df["A"] + df["B"]df A B Ca 3 5 8b 4 6 10 ```

For slightly more complicated operations, use the DataFrame's native methods:

``` df["C"] = df.max(axis=1)df A B Ca 3 5 5b 4 6 6 ```

Note the following:

• we are populating the new column `C` with the maximum of each row (`axis=1`).

• the return type of `df.max(axis=1)` is `Series`.

Creating binary column values

Consider the following Pandas DataFrame:

``` df = pd.DataFrame({'name':['Alex','Bob','Cathy'],'age':[20,30,40]})df.head() name age0 Alex 201 Bob 302 Cathy 40 ```

To create a new column of binary values that are based on the `age` column, use NumPy's `where(~)` method:

``` df['status'] = np.where(df['age'] < 25, 'JUNIOR', 'SENIOR')df.head() name age status0 Alex 20 JUNIOR1 Bob 30 SENIOR2 Cathy 40 SENIOR ```

Here, the first argument of the `where(~)` method is a boolean mask. If the boolean value is `True`, then resulting value will be `'JUNIOR'`, otherwise the value will be `'SENIOR'`.

Creating column with multiple values

Once again, consider the following Pandas DataFrame:

``` df = pd.DataFrame({'name':['Alex','Bob','Cathy'],'age':[20,30,40]})df.head() name age0 Alex 201 Bob 302 Cathy 40 ```

To create a new column with multiple values based on the `age` column, use the `apply(~)` function:

``` def my_func(row): if row['age'] < 25: val = 'JUNIOR' elif row['age'] < 35: val = 'MID-LEVEL' else: val = 'SENIOR' return valdf['status'] = df.apply(my_func, axis=1)df.head() name age status0 Alex 20 JUNIOR1 Bob 30 MID-LEVEL2 Cathy 40 SENIOR ```

Here, the `apply(~)` function is iteratively called for each row, and takes in as argument a `Series` representing a row.

Creating column via mapping

Consider the same Pandas DataFrame as before:

``` df = pd.DataFrame({'name':['Alex','Bob','Cathy'],'age':[20,30,40]})df.head() name age0 Alex 201 Bob 302 Cathy 40 ```

To create a new column that is based on some mapping of an existing column:

``` mapping = { 'Alex': 'ALEX', 'Bob': 'BOB', 'Cathy': 'CATHY'}df['upper_name'] = df['name'].replace(mapping)df.head() name age upper_name0 Alex 20 ALEX1 Bob 30 BOB2 Cathy 40 CATHY ```

Creating column using the assign method

Consider the following Pandas DataFrame:

``` df = pd.DataFrame({"A":[3,4],"B":[5,6]}, index=["a","b"])df A Ba 3 5b 4 6 ```

We could also use the DataFrame's `assign(~)` method, which takes in as argument a function with the DataFrame as the input and returns the new column values:

``` def foo(df): if df["A"].sum() > df["B"].sum(): return [-1,-1] else: return [0,0]df.assign(C=foo) A B C0 3 5 01 4 6 0 ```

Note the following:

• if the sum of column `A` is larger than that of column `B`, then `[-1,-1]` will be used as the new column, otherwise `[0,0]` will be used.

• the keyword argument (`C`) became the new column label.

