search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

Combining columns into a single column of arrays in PySpark DataFrame

schedule Aug 12, 2023
Last updated
local_offer
PySpark
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

To combine multiple columns into a single column of arrays in PySpark DataFrame:

  • use the array(~) method in the pyspark.sql.functions library to combine non-array columns.

  • use the concat(~) method to combine multiple columns of type array together

Combining columns of non-array values into a single column

Consider the following PySpark DataFrame:

df = spark.createDataFrame([['Alex','Jobs'], ['Bob','Miley'], ['Cathy','Lee']], ['fname','lname'])
df.show()
+-----+-----+
|fname|lname|
+-----+-----+
| Alex| Jobs|
| Bob|Miley|
|Cathy| Lee|
+-----+-----+

To combine the columns fname and lname into a single column of arrays, use the array(~) method:

from pyspark.sql import functions as F
df_merged = df.select(F.array('fname', 'lname').alias('merged'))
df_merged.show()
+------------+
| merged|
+------------+
|[Alex, Jobs]|
|[Bob, Miley]|
|[Cathy, Lee]|
+------------+

Here:

  • we are using the alias(~) method to assign a label to the combined column returned by array(~).

  • we convert the PySpark Column returned by array(~) into a PySpark DataFrame using the select(~) method so that we can display the new column content via show() method.

NOTE

The argument of array(~) is of variable-length. This means that we can specify as many columns as we wish for merging:

F.array(col1,col2,col3)

We can see the data type of the merged column using the printSchema() method:

df_merged.printSchema()
root
|-- merged: array (nullable = false)
| |-- element: string (containsNull = true)

The output tells us that the merged column is of type array of strings.

Combining columns of arrays into a single column

Consider the following PySpark DataFrame containing two array-type columns:

df = spark.createDataFrame([[['a'],['b']], [['c'],['d','e']]], ['A','B'])
df.show()
+---+------+
| A| B|
+---+------+
|[a]| [b]|
|[c]|[d, e]|
+---+------+

To combine columns A and B as a single column of arrays:

df_merged = df.select(F.concat('A','B'))
df_merged.show()
+------------+
|concat(A, B)|
+------------+
| [a, b]|
| [c, d, e]|
+------------+
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
thumb_up
3
thumb_down
1
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!