menu

login

Log in

Linear Algebra

Prob and Stats

Other math topics

Machine Learning

Dagster (NEW)

search

Search

Login

Unlock 100+ guides

menu

menu

search toc

close

Outline

Parameters Return Value Examples Getting rows of PySpark DataFrame that exist in another PySpark DataFrame

Comments

Log in or sign up

Cancel

Post

account_circle

exit_to_app

Sign out

What does this mean?

Why is this true?

Give me some examples!

search

keyboard_voice

close

Searching Tips

Search for a recipe:
"Creating a table in MySQL"

Search for an API documentation: "@append"

Search for code: "!dataframe"

Apply a tag filter: "#python"

Useful Shortcuts

/ to open search panel

Esc to close search panel

↑↓ to navigate between search results

⌘d to clear all current filters

⌘Enter to expand content preview

icon_star

Doc Search

icon_star

Code Search Beta

SORRY NOTHING FOUND!

mic

Start speaking...

Voice search is only supported in Safari and Chrome.

fullscreen_exit

Shrink

Navigate to

PySpark

147 guides

keyboard_arrow_down

Linear Algebra

Prob and Stats

Machine Learning

Other math topics

check_circle

Mark as learned

thumb_up

0

thumb_down

3

chat_bubble_outline

0

Comment

auto_stories Bi-column layout

settings

PySpark DataFrame | intersect method

schedule Aug 12, 2023

Last updated

local_offer

PySpark

Tags

tocTable of Contents

expand_more

Parameters Return Value Examples Getting rows of PySpark DataFrame that exist in another PySpark DataFrame

Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

PySpark DataFrame's intersect(~) method returns a new PySpark DataFrame with rows that exist in another PySpark DataFrame. Note that unlike intersectAll(~), intersect(~) only includes duplicate rows once.

NOTE

The intersect(~) method is equivalent to the INTERSECT statement in SQL.

Parameters

1. other | PySpark DataFrame

The other PySpark DataFrame with which to perform intersection.

Return Value

A new PySpark DataFrame.

Examples

Consider the following PySpark DataFrame:


        
        
            
                
                
                    df = spark.createDataFrame([("Alex", 20), ("Bob", 30), ("Cathy", 40)], ["name", "age"])
df.show()
                
            
            +-----+---+
| name|age|
+-----+---+
| Alex| 20|
|  Bob| 30|
|Cathy| 40|
+-----+---+

Consider the other PySpark DataFrame:


        
        
            
                
                
                    df_other = spark.createDataFrame([("Alex", 20), ("Doge", 30), ("eric", 40)], ["name", "age"])
df_other.show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 20|
|Doge| 30|
|eric| 40|
+----+---+

Getting rows of PySpark DataFrame that exist in another PySpark DataFrame

To get rows of a PySpark DataFrame that exist in another PySpark DataFrame, use the intersect(~) method like so:


        
        
            
                
                
                    df_intersect = df.intersect(df_other)
df_intersect.show()
                
            
            +----+---+
|name|age|
+----+---+
|Alex| 20|
+----+---+

Here, we get this row because both PySpark DataFrames contained this row.

robocat

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.DataFrame.intersect.html

thumb_up

0

thumb_down

3

chat_bubble_outline

0

settings

Enjoy our search

Hit / to insta-search docs and recipes!

Navigation

Contact us

Resources

Python Pandas MySQL Beautiful Soup Matplotlib NumPy PySpark

Community

Join our Discord

Join our newsletter for updates on new comprehensive DS/ML guides

|