PySpark DataFrame | intersectAll method
intersectAll(~) method returns a new PySpark DataFrame with rows that also exist in the other PySpark DataFrame. Unlike
intersectAll(~) method preserves duplicates.
intersectAll(~) method is identical to to the
INTERSECT ALL statement in SQL.
other | PySpark DataFrame
The other PySpark DataFrame.
A new PySpark DataFrame.
Consider the following PySpark DataFrame:
Suppose the other PySpark DataFrame is:
Here, note the following:
the only matching row is
Alex's row appears twice in both
Getting rows that also exist in other PySpark DataFrame while preserving duplicates
To get rows that also exist in other PySpark DataFrame while preserving duplicates:
df_res = df.intersectAll(df_other)df_res.show()+----+---+|name|age|+----+---+|Alex| 20||Alex| 20|+----+---+
Note the following:
Alex's row is duplicated because
Alex's row appears twice in
Alex's row only appeared once in one DataFrame but appeared multiple times in another,
Alex's row will only be included once in the resulting DataFrame.
if you want to include duplicating rows only once, then use the