search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Pandas DataFrame | join method

schedule Aug 12, 2023
Last updated
local_offer
PythonPandas
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

Pandas DataFrame.join(~) merges the source DataFrame with another Series or DataFrames.

NOTE

The join(~) method is a wrapper around the merge(~) method, so if you want more control over the join process, consider using merge(~) instead.

Parameters

1. otherlink | Series or DataFrame or list of DataFrames

The other object to join with.

2. onlink | string or list | optional

The column or index level name of the source DataFrame to perform the join on. The index of the other will be used for the join. If you want to specify a non-index column to join on for other, use merge(~) instead, which has a right_on parameter.

3. howlink | string | optional

The type of join to perform:

Value

Description

"left"link

All rows from the source DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of a left-join.

"right"link

All rows from the right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of a right-join.

"outer"link

All rows from the source and right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of an outer-join.

"inner"link

All rows that have matching values in both the source and right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent to inner-join.

By default, how="left".

Here's the classic Venn Diagram illustrating the differences:

4. lsuffixlink | string | optional

The suffix to append to the overlapping label of the source DataFrame. This is only relevant if there are duplicate column labels in the result. By default, lsuffix="".

5. rsuffixlink | string | optional

The suffix to append to the overlapping label of other. This is only relevant if there are duplicate column labels in the result. By default, rsuffix="".

6. sortlink | boolean | optional

Whether or not to sort the rows based on the join key. By default, sort=False.

Return Value

A merged DataFrame.

Examples

Basic usage

Consider the following DataFrame about some products of a shop:

df_products = pd.DataFrame({"product": ["computer", "smartphone", "headphones"],
"bought_by": ["bob", "alex", "bob"]},
index=["A","B","C"])
df_products
product bought_by
A computer bob
B smartphone alex
C headphones bob

Here's a DataFrame about some customers of the shop:

df_customers = pd.DataFrame({"age": [10, 20, 30]},
index=["alex","bob","cathy"])
df_customers
age
alex 10
bob 20
cathy 30

To perform a left-join on the bought_by column of df_products:

df_products.join(df_customers, on="bought_by") # how="left"
product bought_by age
A computer bob 20
B smartphone alex 10
C headphones bob 20

By default, the index of other will be used as the join key. If you want more flexibility as to which columns are used for the join, use the merge(~) method instead.

Comparison of different joins

Consider the following DataFrames about products and customers:

df_products = pd.DataFrame({"product": ["computer", "smartphone", "headphones"],
"bought_by": ["bob", "alex", "david"]},
index=["A","B","C"])
df_customers = pd.DataFrame({"age": [10, 20, 30]}, index=["alex","bob","cathy"])
[df_products] | [df_customers]
product bought_by | age
A computer bob | alex 10
B smartphone alex | bob 20
C headphones david | cathy 30

Left join

df_products.join(df_customers, on="bought_by", how="left")
product bought_by age
A computer bob 20.0
B smartphone alex 10.0
C headphones david NaN

Right join

df_products.join(df_customers, on="bought_by", how="right")
product bought_by age
B smartphone alex 10
A computer bob 20
NaN NaN cathy 30

Inner join

df_products.join(df_customers, on="bought_by", how="inner")
product bought_by age
A computer bob 20
C headphones bob 20
B smartphone alex 10

Outer join

df_products.join(df_customers, on="bought_by", how="outer")
product bought_by age
A computer bob 20
C headphones bob 20
B smartphone alex 10
NaN NaN cathy 30

Specifying lsuffix and rsuffix

Suppose we wanted to join the following DataFrames:

df_products = pd.DataFrame({"product": ["computer", "smartphone", "headphones"],
"age": [5,6,7],
"bought_by": ["bob", "alex", "bob"]},
index=["A","B","C"])
df_customers = pd.DataFrame({"age": [10, 20, 30]},
index=["alex","bob","cathy"])
product age bought_by | age
A computer 5 bob | alex 10
B smartphone 6 alex | bob 20
C headphones 7 bob | cathy 30

Notice how both DataFrames have the age column.

Due to this overlap in the naming, performing the join results in a ValueError:

df_products.join(df_customers, on="bought_by")
ValueError: columns overlap but no suffix specified: Index(['age'], dtype='object')

This error can be resolved by specifying lsuffix or rsuffix:

df_products.join(df_customers, on="bought_by", lsuffix="_product")
product age_product bought_by age
A computer 5 bob 10
B smartphone 6 alex 10
C headphones 7 bob 20

Note that lsuffix and rsuffix only take effect when there are duplicate column labels.

Specifying sort

Consider the same example as before:

[df_products] | [df_customers]
product bought_by | age
A computer bob | alex 10
B smartphone alex | bob 20
C headphones bob | cathy 30

By default, sort=False, which means that the resulting DataFrame's rows are not sorted by the join key (bought_by):

df_products.join(df_customers, on="bought_by") # sort=False
product bought_by age
A computer bob 20
B smartphone alex 10
C headphones bob 20

Setting sort=True yields:

df_products.join(df_customers, on="bought_by", sort=True)
product bought_by age
B smartphone alex 10
A computer bob 20
C headphones bob 20
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
thumb_up
1
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!