Combining columns into a single column of arrays in PySpark DataFrame
Start your free 7-days trial now!
To combine multiple columns into a single column of arrays in PySpark DataFrame:
array(~)method in the
pyspark.sql.functionslibrary to combine non-array columns.
concat(~)method to combine multiple columns of type array together
Combining columns of non-array values into a single column
Consider the following PySpark DataFrame:
To combine the columns
lname into a single column of arrays, use the
we are using the
alias(~)method to assign a label to the combined column returned by
The argument of
array(~) is of variable-length. This means that we can specify as many columns as we wish for merging:
We can see the data type of the merged column using the
root|-- merged: array (nullable = false)| |-- element: string (containsNull = true)
The output tells us that the merged column is of type array of strings.
Combining columns of arrays into a single column
Consider the following PySpark DataFrame containing two array-type columns:
To combine columns
B as a single column of arrays: