PySpark DataFrame | join method
PySpark DataFrame's join(~)
method joins two DataFrames using the given join method.
Parameters
1. other
| DataFrame
The other PySpark DataFrame with which to join.
2. on
| string
or list
or Column
| optional
The columns to perform the join on.
3. how
| string
| optional
By default, how="inner"
. See examples below for the type of joins implemented.
Return Value
A PySpark DataFrame (pyspark.sql.dataframe.DataFrame
).
Examples
Performing inner, left and right joins
Consider the following PySpark DataFrames:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24||Cathy| 22|+-----+---+
The other PySpark DataFrame:
+----+------+|name|salary|+----+------+|Alex| 250|| Bob| 200||Doge| 100|+----+------+
Inner join
For inner join, all rows that have matching values in both the source and right DataFrame will be present in the resulting DataFrame:

+----+---+------+|name|age|salary|+----+---+------+|Alex| 20| 250|+----+---+------+
Left join and left-outer join
For left join (or left-outer join), all rows in the left DataFrame and matching rows in the right DataFrame will be present in the resulting DataFrame:

+-----+---+------+| name|age|salary|+-----+---+------+| Alex| 20| 250|| Bob| 24| 200||Cathy| 22| null|+-----+---+------+
Right join and right-outer join
For right (right-outer) join, all rows in the right DataFrame and matching rows in the left DataFrame will be present in the resulting DataFrame:

+----+----+------+|name| age|salary|+----+----+------+|Alex| 20| 250|| Bob| 24| 200||Doge|null| 100|+----+----+------+
Performing outer join
Consider the same PySpark DataFrames as before:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24||Cathy| 22|+-----+---+
This is the other PySpark DataFrame:
+----+------+|name|salary|+----+------+|Alex| 250|| Bob| 200||Doge| 100|+----+------+
For outer join, both the left and right DataFrames will be present:

+-----+----+------+| name| age|salary|+-----+----+------+| Alex| 20| 250|| Bob| 24| 200||Cathy| 22| null|| Doge|null| 100|+-----+----+------+
Performing left-anti and left-semi joins
Consider the same PySpark DataFrames as before:
+-----+---+| name|age|+-----+---+| Alex| 20|| Bob| 24||Cathy| 22|+-----+---+
This is the other DataFrame:
+----+------+|name|salary|+----+------+|Alex| 250|| Bob| 200||Doge| 100|+----+------+
Left anti-join
For left anti-join, all rows in the left DataFrame that are not present in the right DataFrame will be in the resulting DataFrame:

+-----+---+| name|age|+-----+---+|Cathy| 22|+-----+---+
Left semi-join
Left semi-join is the opposite of left-anti join, that is, all rows in the left DataFrame that are present in the right DataFrame will be in the resulting DataFrame:
+----+---+|name|age|+----+---+|Alex| 20|| Bob| 24|+----+---+