search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

PySpark DataFrame | join method

schedule Aug 12, 2023
Last updated
local_offer
PySpark
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

PySpark DataFrame's join(~) method joins two DataFrames using the given join method.

Parameters

1. other | DataFrame

The other PySpark DataFrame with which to join.

2. on | string or list or Column | optional

The columns to perform the join on.

3. how | string | optional

By default, how="inner". See examples below for the type of joins implemented.

Return Value

A PySpark DataFrame (pyspark.sql.dataframe.DataFrame).

Examples

Performing inner, left and right joins

Consider the following PySpark DataFrames:

df1 = spark.createDataFrame([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+

The other PySpark DataFrame:

df2 = spark.createDataFrame([["Alex", 250], ["Bob", 200], ["Doge", 100]], ["name", "salary"])
df2.show()
+----+------+
|name|salary|
+----+------+
|Alex| 250|
| Bob| 200|
|Doge| 100|
+----+------+

Inner join

For inner join, all rows that have matching values in both the source and right DataFrame will be present in the resulting DataFrame:

df1.join(df2, on="name", how="inner").show() # how="cross" also works
+----+---+------+
|name|age|salary|
+----+---+------+
|Alex| 20| 250|
| Bob| 24| 200|
+----+---+------+

Left join and left-outer join

For left join (or left-outer join), all rows in the left DataFrame and matching rows in the right DataFrame will be present in the resulting DataFrame:

df1.join(df2, on="name", how="left").show() # how="left_outer" works
+-----+---+------+
| name|age|salary|
+-----+---+------+
| Alex| 20| 250|
| Bob| 24| 200|
|Cathy| 22| null|
+-----+---+------+

Right join and right-outer join

For right (right-outer) join, all rows in the right DataFrame and matching rows in the left DataFrame will be present in the resulting DataFrame:

df1.join(df2, on="name", how="right").show() # how="right_outer" also works
+----+----+------+
|name| age|salary|
+----+----+------+
|Alex| 20| 250|
| Bob| 24| 200|
|Doge|null| 100|
+----+----+------+

Performing outer join

Consider the same PySpark DataFrames as before:

df1 = spark.createDataFrame([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+

This is the other PySpark DataFrame:

df2 = spark.createDataFrame([["Alex", 250], ["Bob", 200], ["Doge", 100]], ["name", "salary"])
df2.show()
+----+------+
|name|salary|
+----+------+
|Alex| 250|
| Bob| 200|
|Doge| 100|
+----+------+

For outer join, both the left and right DataFrames will be present:

df1.join(df2, on="name", how="outer").show() # how="full" or "fullouter" also works
+-----+----+------+
| name| age|salary|
+-----+----+------+
| Alex| 20| 250|
| Bob| 24| 200|
|Cathy| 22| null|
| Doge|null| 100|
+-----+----+------+

Performing left-anti and left-semi joins

Consider the same PySpark DataFrames as before:

df1 = spark.createDataFrame([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+

This is the other DataFrame:

df2 = spark.createDataFrame([["Alex", 250], ["Bob", 200], ["Doge", 100]], ["name", "salary"])
df2.show()
+----+------+
|name|salary|
+----+------+
|Alex| 250|
| Bob| 200|
|Doge| 100|
+----+------+

Left anti-join

For left anti-join, all rows in the left DataFrame that are not present in the right DataFrame will be in the resulting DataFrame:

df1.join(df2, on="name", how="left_anti").show() # how="leftanti" also works
+-----+---+
| name|age|
+-----+---+
|Cathy| 22|
+-----+---+

Left semi-join

Left semi-join is the opposite of left-anti join, that is, all rows in the left DataFrame that are present in the right DataFrame will be in the resulting DataFrame:

df1.join(df2, on="name", how="left_semi").show() # how="leftsemi" also works
+----+---+
|name|age|
+----+---+
|Alex| 20|
| Bob| 24|
+----+---+

Performing join on different column names

Up to now, we have specified the join key using the on parameter. Let's now consider the case when the join keys have different labels. Suppose one DataFrame is as follows:

df1 = spark.createDataFrame([["Alex", 20], ["Bob", 24], ["Cathy", 22]], ["name", "age"])
df1.show()
+-----+---+
| name|age|
+-----+---+
| Alex| 20|
| Bob| 24|
|Cathy| 22|
+-----+---+

Suppose the other DataFrame is as follows:

df2 = spark.createDataFrame([["Alex", 250], ["Bob", 200], ["Doge", 100]], ["NAME", "salary"])
df2.show()
+----+------+
|NAME|salary|
+----+------+
|Alex| 250|
| Bob| 200|
|Doge| 100|
+----+------+

We can join using name of df1 and NAME of df2 like so:

cond = [df1["name"] == df2["NAME"]]
df1.join(df2, on=cond, how="inner").show()
+----+---+----+------+
|name|age|NAME|salary|
+----+---+----+------+
|Alex| 20|Alex| 250|
| Bob| 24| Bob| 200|
+----+---+----+------+

Here, we can supply multiple join keys since on accepts a list.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...