PySpark DataFrame | foreach method
foreach(~) method loops over each row of the DataFrame as a
Row object and applies the given function to the row.
The following are some limitations of
foreach(~)method in Spark is invoked in the worker nodes instead of the Driver program. This means that if we perform a
print(~)inside our function, we will not be able to see the printed results in our session or notebook because the results are printed in the worker node instead.
rows are read-only and so you cannot update values of the rows.
Given these limitations, the
foreach(~) method is mainly used for logging some information about each row to the local machine or to an external database.
The function to apply to each row (
Row) of the DataFrame.
Nothing is returned.
Consider the following PySpark DataFrame:
To iterate over each row and apply some custom function:
row.name is printed in the worker nodes so you would not see any output in the driver program.