If True, then no error will be thrown if the column labels of the two DataFrames do not align. If in case of misalignments, then null values will be set.
If False, then an error will be thrown if the column labels of the two DataFrames do not align.

By default, allowMissingColumns=False.

Return Value

A new PySpark DataFrame.

Examples

Concatenating PySpark DataFrames vertically by aligning columns

Consider the following PySpark DataFrame:


        
        
            
                
                
                    df1 = spark.createDataFrame([[1, 2, 3]], ["A", "B", "C"])
df1.show()
                
            
            +---+---+---+
|  A|  B|  C|
+---+---+---+
|  1|  2|  3|
+---+---+---+

Here's another PySpark DataFrame:


        
        
            
                
                
                    df2 = spark.createDataFrame([[4, 5, 6], [7, 8, 9]], ["A", "B", "C"])
df2.show()
                
            
            +---+---+---+
|  A|  B|  C|
+---+---+---+
|  4|  5|  6|
|  7|  8|  9|
+---+---+---+

To concatenate these two DataFrames vertically by aligning the columns:


        
        
            
                
                
                    df1.unionByName(df2).show()
                
            
            +---+---+---+
|  A|  B|  C|
+---+---+---+
|  1|  2|  3|
|  4|  5|  6|
|  7|  8|  9|
+---+---+---+

Dealing with cases when column labels mismatch

By default, allowMissingColumns=False, which means that if the two DataFrames do not have exactly matching column labels, then an error will be thrown.

For example, consider the following PySpark DataFrames:


        
        
            
                
                
                    df1 = spark.createDataFrame([[1, 2, 3]], ["A", "B", "C"])
df1.show()
                
            
            +---+---+---+
|  A|  B|  C|
+---+---+---+
|  1|  2|  3|
+---+---+---+

Here's the other PySpark DataFrame that have slightly different column labels:


        
        
            
                
                
                    df2 = spark.createDataFrame([[4, 5, 6], [7, 8, 9]], ["B", "C", "D"])
df2.show()
                
            
            +---+---+---+
|  B|  C|  D|
+---+---+---+
|  4|  5|  6|
|  7|  8|  9|
+---+---+---+

Since the column labels do not match, calling unionByName(~) will result in an error:


        
        
            
                
                
                    df1.unionByName(df2).show()   # allowMissingColumns=False
                
            
            AnalysisException: Cannot resolve column name "A" among (B, C, D)

To allow for misaligned columns, set allowMissingColumns=True:


        
        
            
                
                
                    df1.unionByName(df2, allowMissingColumns=True).show()
                
            
            +----+---+---+----+
|   A|  B|  C|   D|
+----+---+---+----+
|   1|  2|  3|null|
|null|  4|  5|   6|
|null|  7|  8|   9|
+----+---+---+----+

Notice how we have null values for the misaligned columns.

Published by Isshin Inada

Edited by 0 others

Did you find this page useful?

thumb_up

thumb_down

Comment

Citation

Ask a question or leave a feedback...

Official PySpark Documentation

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.unionByName.html

thumb_up

thumb_down

chat_bubble_outline

settings

Enjoy our search

Hit / to insta-search docs and recipes!