Pandas DataFrame | merge method
Start your free 7-days trial now!
Pandas DataFrame.merge(~) method merges the source DataFrame with another DataFrame or a named Series.
Parameters
1. right | DataFrame or named Series
The DataFrame or Series to merge the source DataFrame with.
2. how | string | optional
The type of merge to perform:
Value | Description |
|---|---|
All rows from the source DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of a left-join. | |
All rows from the right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of a right-join. | |
All rows from the source and right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent of an outer-join. | |
All rows that have matching values in both the source and right DataFrame will be present in the resulting DataFrame. This is the SQL equivalent to inner-join. |
By default, how="inner".
Here's the classic Venn Diagram illustrating the differences:
3. on | string or list | optional
The label of the columns or index-levels to perform the join on. This only works if both left and right DataFrames have this same label. By default, on=None, which means that an inner join will be performed.
The on parameter is there only for convenience. If the columns to join on have different labels, then you must use left_on, right_on, left_index and right_index instead.
4. left_on | string or array-like | optional
The label(s) of the column or index level of the left DataFrame to perform the join on.
5. right_on | string or array-like | optional
The label(s) of the column or index level of the right DataFrame to perform the join on.
6. left_index | boolean | optional
Whether or not to perform the join on the index of the left DataFrame. By default, left_index=False.
7. right_index | boolean | optional
Whether or not to perform the join on the index of the right DataFrame. By default, right_index=False.
Many textbooks and documentation use the words merge keys or join keys to denote the columns by which a join is performed.
8. sortlink | boolean | optional
Whether or not to sort the rows based on the join key. By default, sort=False.
9. suffixeslink | tuple of (string, string) | optional
The suffix names to append to the duplicate column labels in the resulting DataFrame. By default, suffixes=("_x", "_y").
10. copy | boolean | optional
If
True, then return a new copy of the DataFrame.If
False, then avoid creating a new copy if possible.
By default, copy=True.
11. indicatorlink | boolean or string | optional
Whether or not to add append an extra column called _merge, which tells us which DataFrame the row was constructed from. By default, indicator=False.
12. validatelink | string | optional
The validation logic to run:
Value | Description |
|---|---|
| Checks whether the merge keys in the left and right are unique. |
| Checks whether the merge keys in the left are unique. |
| Checks whether the merge keys in the right are unique. |
| No check is performed. |
By default, validate=None.
Return value
A merged DataFrame.
Examples
Performing an inner-join
Suppose a shopkeeper has the following data:
id | product | bought_by |
|---|---|---|
A | computer | 1 |
B | smartphone | 3 |
C | headphones | NaN |
id | name | age |
|---|---|---|
1 | alex | 10 |
2 | bob | 20 |
3 | cathy | 30 |
Here, the top table is about the products that the shop has, while the bottom table is about the profile of registered customers.
Suppose we wanted to see the profile of the customers who have purchased a product. To do this, we need to perform an inner join on the bought_by column of the products table and the index labels of the customers table.
First off, let's create the two DataFrames. Here's the products DataFrame:
"bought_by": [1, 3, pd.np.NaN]}, index=["A","B","C"])df_products
product bought_byA computer 1.0B smartphone 3.0C headphones NaN
Here's the customers DataFrame:
"age": [10, 20, 30]}, index=[1,2,3])df_customers
name age1 alex 102 bob 203 cathy 30
Now, to perform the inner-join:
df_products.merge(df_customers, how="inner", left_on="bought_by", right_index=True)
product bought_by name ageA computer 1.0 alex 10B smartphone 3.0 cathy 30
Note the following:
left_on="bought_by"indicates that we want to align using thebought_bycolumn of the left DataFrame.right_index=Trueindicates that we want to align using the index of the right DataFrame. We can also merge based on columns instead by specifying theright_onparameter.You can omit the
howparameter here since the default value is"inner".Notice the ordering of the columns - the columns of the source DataFrame (i.e.
df_products) come first, followed by those of the right DataFrame.
Performing a left-join
To demonstrate a left-join, we shall use the same example of products and customers as before:
[df_products] | [df_customers] product bought_by | name ageA computer 1.0 | 1 alex 10B smartphone 3.0 | 2 bob 20C headphones NaN | 3 cathy 30
To see all products as well as the profile of the customers who have purchased the products, we need to perform a left-join like so:
df_products.merge(df_customers, how="left", left_on="bought_by", right_index=True)
product bought_by name ageA computer 1.0 alex 10.0B smartphone 3.0 cathy 30.0C headphones NaN NaN NaN
left_on="bought_by"indicates that we want to align using thebought_bycolumn of the left DataFrame.right_index=Trueindicates that we want to align using the index of the right DataFrame.Notice the ordering of the columns - the columns of the source DataFrame (i.e.
df_products) come first, followed by those of the right DataFrame.
Performing a right-join
To demonstrate a right-join, we shall use the same example of products and customers as before:
[df_products] | [df_customers] product bought_by | name ageA computer 1.0 | 1 alex 10B smartphone 3.0 | 2 bob 20C headphones NaN | 3 cathy 30
To see the profile of all customers as well as the products they purchased, we need to perform a right-join like so:
df_products.merge(df_customers, how="right", left_on="bought_by", right_index=True)
product bought_by name ageA computer 1.0 alex 10B smartphone 3.0 cathy 30NaN NaN 2.0 bob 20
Performing a full-join
To demonstrate a full-join, we shall use the same example of products and customers as before:
[df_products] | [df_customers] product bought_by | name ageA computer 1.0 | 1 alex 10B smartphone 3.0 | 2 bob 20C headphones NaN | 3 cathy 30
To see the profile of all customers as well as all products, we need to perform a full-join like so:
df_products.merge(df_customers, how="outer", left_on="bought_by", right_index=True)
product bought_by name ageA computer 1.0 alex 10.0B smartphone 3.0 cathy 30.0C headphones NaN NaN NaNNaN NaN 2.0 bob 20.0
Specifying the sort parameter
To demonstrate what the sort parameter does, we shall use the same example of products and customers as before:
[df_products] | [df_customers] product bought_by | name ageA computer 3.0 | 1 alex 10B smartphone 1.0 | 2 bob 20C headphones NaN | 3 cathy 30
Just as a side note, the ordering of the values in the bought_by column was swapped as this will allow us to see the behaviour of sort=True.
Suppose we performed a left-join, like so:
df_products.merge(df_customers, how="left", left_on="bought_by", right_index=True, sort=True)
product bought_by name ageB smartphone 1.0 alex 10.0A computer 3.0 cathy 30.0C headphones NaN NaN NaN
Notice how the resulting DataFrame is sorted based on the join key - the bought_by column. Without sort=True, the column would not be sorted, that is, the original order will be kept.
Specifying the suffixes parameter
To demonstrate what the suffix parameter does, we shall use the same example of products and customers as before:
[df_products] | [df_customers] name bought_by | name ageA computer 1.0 | 1 alex 10B smartphone 3.0 | 2 bob 20C headphones NaN | 3 cathy 30
Note that previously we used the column label product for the product names, but for this example, we will be using the column label name instead. We now have an overlap in the column names.
To see how Pandas deal with this by default, let's perform an inner-join:
df_products.merge(df_customers, how="inner", left_on="bought_by", right_index=True)
name_x bought_by name_y ageA computer 1.0 alex 10B smartphone 3.0 cathy 30
We see that Pandas has added the suffixes _x and _y to differentiate the columns.
We can override this behaviour by specifying the suffixes parameter, which simply takes in a tuple of suffixes:
df_products.merge(df_customers, how="inner", left_on="bought_by", right_index=True, suffixes=("_product", "_customer"))
name_product bought_by name_customer ageA computer 1.0 alex 10B smartphone 3.0 cathy 30
Observe how the column names now reflect our specified suffixes.
Specifying the indicator parameter
To demonstrate what the indicator parameter does, we shall use the same example of products and customers as before:
[df_products] | [df_customers] product bought_by | name ageA computer 1.0 | 1 alex 10B smartphone 3.0 | 2 bob 20C headphones NaN | 3 cathy 30
Suppose we performed a left join with indicator=True:
df_products.merge(df_customers, how="left", left_on="bought_by", right_index=True, indicator=True)
product bought_by name age _mergeA computer 1.0 alex 10.0 bothB smartphone 3.0 cathy 30.0 bothC headphones NaN NaN NaN left_only
Note the following:
Notice how we end up with an additional column at the end called
_merge. This column tells us which DataFrame the row was constructed from.The first two rows have the values
"both", which means that a match was found in both the source DataFrame and therightDataFrame. You can easily confirm that this isTrueby looking back at our two DataFrames.The last row has the value
"left_only", which means that the row comes from the source DataFrame.
Specifying the validate parameter
We'll use the same example about products and customers again:
[df_products] | [df_customers] product bought_by | name ageA computer 1.0 | 1 alex 10B smartphone 3.0 | 2 bob 20C headphones 3.0 | 3 cathy 30
For this example, we changed the "bought_by" value of headphones from NaN to 3.
Let's perform an inner-join on the customer id:
df_products.merge(df_customers, how="inner", right_index=True, left_on="bought_by")
product bought_by name ageA computer 3 cathy 30C headphones 3 cathy 30B smartphone 1 alex 10
Here, one customer (Cathy) has bought 2 products, so this is a 1:2 mapping, which is often written as 1:m where m just represents a number greater than 1.
Let's now call the exact same function, but with validate="1:1":
df_products.merge(df_customers, how="inner", right_index=True, left_on="bought_by", validate="1:1")
MergeError: Merge keys are not unique in left dataset; not a one-to-one merge
Since this is not a 1:1 mapping, an error is thrown. On a more general note, if the left DataFrame has duplicate values, then validation rules "1:1" and "1:m" would throw an error.