PySpark DataFrame | dropDuplicates method
dropDuplicates(~) returns a new DataFrame with duplicate rows removed. We can optionally specify columns to check for duplicates.
dropDuplicates(~) is an alias for
The columns by which to check for duplicates. By default, all columns will be checked.
A new PySpark DataFrame.
Consider the following PySpark DataFrame:
Dropping duplicate rows in PySpark DataFrame
To drop duplicate rows:
+-----+---+| name|age|+-----+---+| Alex| 25|| Bob| 30||Cathy| 25|+-----+---+
Note the following:
only the first occurrence is kept while subsequent occurrences are removed.
a new PySpark DataFrame is returned while the original is kept intact.
Dropping duplicate rows for certain columns
To drop duplicate rows based on the
+----+---+|name|age|+----+---+|Alex| 25|| Bob| 30|+----+---+
Again, only the first occurrence is kept while the latter duplicate rows are discarded.