PySpark DataFrame | foreach method
Start your free 7-days trial now!
PySpark DataFrame's foreach(~) method loops over each row of the DataFrame as a Row object and applies the given function to the row.
The following are some limitations of foreach(~):
the
foreach(~)method in Spark is invoked in the worker nodes instead of the Driver program. This means that if we perform aprint(~)inside our function, we will not be able to see the printed results in our session or notebook because the results are printed in the worker node instead.rows are read-only and so you cannot update values of the rows.
Given these limitations, the foreach(~) method is mainly used for logging some information about each row to the local machine or to an external database.
Parameters
1. f | function
The function to apply to each row (Row) of the DataFrame.
Return Value
Nothing is returned.
Examples
Consider the following PySpark DataFrame:
+----+---+|name|age|+----+---+|Alex| 20|| Bob| 30|+----+---+
To iterate over each row and apply some custom function:
Here, the row.name is printed in the worker nodes so you would not see any output in the driver program.