search
Search
Publish
menu
menu search toc more_vert
Robocat
Guest 0reps
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe: "Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
share
thumb_up_alt
bookmark
arrow_backShare
Twitter
Facebook
thumb_up
1
thumb_down
0
chat_bubble_outline
0
auto_stories new
settings

Pandas DataFrame | merge method

Programming
chevron_right
Python
chevron_right
Pandas
chevron_right
Documentation
chevron_right
DataFrame
chevron_right
Combining DataFrames
schedule May 25, 2022
Last updated
local_offer PythonPandas
Tags

Pandas DataFrame.merge(~) method merges the source DataFrame with another DataFrame or a named Series.

Parameters

1. right | DataFrame or named Series

The DataFrame or Series to merge the source DataFrame with.

2. how | string | optional

The type of merge to perform:

Value

Description

"left"link

All rows from the source DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of a left-join.

"right"link

All rows from the right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of a right-join.

"outer"link

All rows from the source and right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of an outer-join.

"inner"link

All rows that have matching values in both the source and right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent to inner-join.

By default, how="inner".

Here's the classic Venn Diagram illustrating the differences:

3. on | string or list | optional

The label of the columns or index-levels to perform the join on. This only works if both left and right DataFrames have this same label. By default, on=None, which means that an inner join will be performed.

NOTE

The on parameter is there only for convenience. If the columns to join on have different labels, then you must use left_on, right_on, left_index and right_index instead.

4. left_on | string or array-like | optional

The label(s) of the column or index level of the left DataFrame to perform the join on.

5. right_on | string or array-like | optional

The label(s) of the column or index level of the right DataFrame to perform the join on.

6. left_index | boolean | optional

Whether or not to perform the join on the index of the left DataFrame. By default, left_index=False.

7. right_index | boolean | optional

Whether or not to perform the join on the index of the right DataFrame. By default, right_index=False.

NOTE

Many textbooks and documentation use the words merge keys or join keys to denote the columns by which a join is performed.

8. sortlink | boolean | optional

Whether or not to sort the rows based on the join key. By default, sort=False.

9. suffixeslink | tuple of (string, string) | optional

The suffix names to append to the duplicate column labels in the resulting DataFrame. By default, suffixes=("_x", "_y").

10. copy | boolean | optional

  • If True, then return a new copy of the DataFrame.

  • If False, then avoid creating a new copy if possible.

By default, copy=True.

11. indicatorlink | boolean or string | optional

Whether or not to add append an extra column called _merge, which tells us which DataFrame the row was constructed from. By default, indicator=False.

12. validatelink | string | optional

The validation logic to run:

Value

Description

Checks whether the merge keys in the left and right are unique.

"one_to_one" or "1:1"

Checks whether the merge keys in the left are unique.

"one_to_many" or "1:m"

Checks whether the merge keys in the right are unique.

"many_to_one" or "m:1"

No check is performed.

"many_to_many" or "m:m"

By default, validate=None.

Return value

A merged DataFrame.

Examples

Performing an inner-join

Suppose a shopkeeper has the following data:

id

product

bought_by

A

computer

1

B

smartphone

3

C

headphones

NaN

id

name

age

1

alex

10

2

bob

20

3

cathy

30

Here, the top table is about the products that the shop has, while the bottom table is about the profile of registered customers.

Suppose we wanted to see the profile of the customers who have purchased a product. To do this, we need to perform an inner join on the bought_by column of the products table and the index labels of the customers table.

First off, let's create the two DataFrames. Here's the products DataFrame:

df_products = pd.DataFrame({"product": ["computer", "smartphone", "headphones"],
"bought_by": [1, 3, pd.np.NaN]},
index=["A","B","C"])
df_products
product bought_by
A computer 1.0
B smartphone 3.0
C headphones NaN

Here's the customers DataFrame:

df_customers = pd.DataFrame({"name": ["alex", "bob", "cathy"],
"age": [10, 20, 30]},
index=[1,2,3])
df_customers
name age
1 alex 10
2 bob 20
3 cathy 30

Now, to perform the inner-join:

df_products.merge(df_customers, how="inner", left_on="bought_by", right_index=True)
product bought_by name age
A computer 1.0 alex 10
B smartphone 3.0 cathy 30

Note the following:

  • left_on="bought_by" indicates that we want to align using the bought_by column of the left DataFrame.

  • right_index=True indicates that we want to align using the index of the right DataFrame. We can also merge based on columns instead by specifying the right_on parameter.

  • You can omit the how parameter here since the default value is "inner".

  • Notice the ordering of the columns - the columns of the source DataFrame (i.e. df_products) come first, followed by those of the right DataFrame.

Performing a left-join

To demonstrate a left-join, we shall use the same example of products and customers as before:

[df_products] | [df_customers]
product bought_by | name age
A computer 1.0 | 1 alex 10
B smartphone 3.0 | 2 bob 20
C headphones NaN | 3 cathy 30

To see all products as well as the profile of the customers who have purchased the products, we need to perform a left-join like so:

df_products.merge(df_customers, how="left", left_on="bought_by", right_index=True)
product bought_by name age
A computer 1.0 alex 10.0
B smartphone 3.0 cathy 30.0
C headphones NaN NaN NaN
  • left_on="bought_by" indicates that we want to align using the bought_by column of the left DataFrame.

  • right_index=True indicates that we want to align using the index of the right DataFrame.

  • Notice the ordering of the columns - the columns of the source DataFrame (i.e. df_products) come first, followed by those of the right DataFrame.

Performing a right-join

To demonstrate a right-join, we shall use the same example of products and customers as before:

[df_products] | [df_customers]
product bought_by | name age
A computer 1.0 | 1 alex 10
B smartphone 3.0 | 2 bob 20
C headphones NaN | 3 cathy 30

To see the profile of all customers as well as the products they purchased, we need to perform a right-join like so:

df_products.merge(df_customers, how="right", left_on="bought_by", right_index=True)
product bought_by name age
A computer 1.0 alex 10
B smartphone 3.0 cathy 30
NaN NaN 2.0 bob 20

Performing a full-join

To demonstrate a full-join, we shall use the same example of products and customers as before:

[df_products] | [df_customers]
product bought_by | name age
A computer 1.0 | 1 alex 10
B smartphone 3.0 | 2 bob 20
C headphones NaN | 3 cathy 30

To see the profile of all customers as well as all products, we need to perform a full-join like so:

df_products.merge(df_customers, how="outer", left_on="bought_by", right_index=True)
product bought_by name age
A computer 1.0 alex 10.0
B smartphone 3.0 cathy 30.0
C headphones NaN NaN NaN
NaN NaN 2.0 bob 20.0

Specifying the sort parameter

To demonstrate what the sort parameter does, we shall use the same example of products and customers as before:

[df_products] | [df_customers]
product bought_by | name age
A computer 3.0 | 1 alex 10
B smartphone 1.0 | 2 bob 20
C headphones NaN | 3 cathy 30

Just as a side note, the ordering of the values in the bought_by column was swapped as this will allow us to see the behaviour of sort=True.

Suppose we performed a left-join, like so:

df_products.merge(df_customers, how="left", left_on="bought_by", right_index=True, sort=True)
product bought_by name age
B smartphone 1.0 alex 10.0
A computer 3.0 cathy 30.0
C headphones NaN NaN NaN

Notice how the resulting DataFrame is sorted based on the join key - the bought_by column. Without sort=True, the column would not be sorted, that is, the original order will be kept.

Specifying the suffixes parameter

To demonstrate what the suffix parameter does, we shall use the same example of products and customers as before:

[df_products] | [df_customers]
name bought_by | name age
A computer 1.0 | 1 alex 10
B smartphone 3.0 | 2 bob 20
C headphones NaN | 3 cathy 30

Note that previously we used the column label product for the product names, but for this example, we will be using the column label name instead. We now have an overlap in the column names.

To see how Pandas deal with this by default, let's perform an inner-join:

df_products.merge(df_customers, how="inner", left_on="bought_by", right_index=True)
name_x bought_by name_y age
A computer 1.0 alex 10
B smartphone 3.0 cathy 30

We see that Pandas has added the suffixes _x and _y to differentiate the columns.

We can override this behaviour by specifying the suffixes parameter, which simply takes in a tuple of suffixes:

df_products.merge(df_customers, how="inner", left_on="bought_by", right_index=True, suffixes=("_product", "_customer"))
name_product bought_by name_customer age
A computer 1.0 alex 10
B smartphone 3.0 cathy 30

Observe how the column names now reflect our specified suffixes.

Specifying the indicator parameter

To demonstrate what the indicator parameter does, we shall use the same example of products and customers as before:

[df_products] | [df_customers]
product bought_by | name age
A computer 1.0 | 1 alex 10
B smartphone 3.0 | 2 bob 20
C headphones NaN | 3 cathy 30

Suppose we performed a left join with indicator=True:

df_products.merge(df_customers, how="left", left_on="bought_by", right_index=True, indicator=True)
product bought_by name age _merge
A computer 1.0 alex 10.0 both
B smartphone 3.0 cathy 30.0 both
C headphones NaN NaN NaN left_only

Note the following:

  • Notice how we end up with an additional column at the end called _merge. This column tells us which DataFrame the row was constructed from.

  • The first two rows have the values "both", which means that a match was found in both the source DataFrame and the right DataFrame. You can easily confirm that this is True by looking back at our two DataFrames.

  • The last row has the value "left_only", which means that the row comes from the source DataFrame.

Specifying the validate parameter

We'll use the same example about products and customers again:

[df_products] | [df_customers]
product bought_by | name age
A computer 1.0 | 1 alex 10
B smartphone 3.0 | 2 bob 20
C headphones 3.0 | 3 cathy 30

For this example, we changed the "bought_by" value of headphones from NaN to 3.

Let's perform an inner-join on the customer id:

df_products.merge(df_customers, how="inner", right_index=True, left_on="bought_by")
product bought_by name age
A computer 3 cathy 30
C headphones 3 cathy 30
B smartphone 1 alex 10

Here, one customer (Cathy) has bought 2 products, so this is a 1:2 mapping, which is often written as 1:m where m just represents a number greater than 1.

Let's now call the exact same function, but with validate="1:1":

df_products.merge(df_customers, how="inner", right_index=True, left_on="bought_by", validate="1:1")
MergeError: Merge keys are not unique in left dataset; not a one-to-one merge

Since this is not a 1:1 mapping, an error is thrown. On a more general note, if the left DataFrame has duplicate values, then validation rules "1:1" and "1:m" would throw an error.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Ask a question or leave a feedback...