search
Search
Login
Map of Data Science
menu
menu search toc more_vert
Robocat
Guest 0reps
Sign up
Log in
account_circleMy Profile homeAbout paidPricing
emailContact us
exit_to_appLog out
Map of data science
Thanks for the thanks!
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
help Ask a question
Share on Twitter
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to
A
A
brightness_medium
share
arrow_backShare
Twitter
Facebook

Extracting the n-th value of lists in PySpark DataFrame

Machine Learning
chevron_right
PySpark
chevron_right
Cookbooks
schedule Jul 9, 2022
Last updated
local_offer PySpark
Tags
map
Check out the interactive map of data science

Consider the following PySpark DataFrame:

rows = [[[10,20]], [[30,40]]]
df = spark.createDataFrame(rows, ['my_col'])
df.show()
+--------+
| my_col|
+--------+
|[10, 20]|
|[30, 40]|
+--------+

Here, my_col contains some lists.

Extracting a single value from arrays in PySpark Column

To extract the second value of each list in my_col:

# df.select(F.col('my_col').getItem(1)) also works!
df_res = df.select(F.col('my_col')[1].alias('second_value'))
df_res.show()
+------------+
|second_value|
+------------+
| 20|
| 40|
+------------+

Here, we are assigning a label to the Column returned by F.col('my_col')[0] using alias(~).

Equivalently, we can use the element_at(~) method instead of using the [~] syntax:

df_res = df.select(F.element_at('my_col',2).alias('second_value'))
df_res.show()
+------------+
|second_value|
+------------+
| 20|
| 40|
+------------+

Note that element_at(~) does not use index-based positioning - the second value in a list is denoted by position 2.

Extracting values from the back

I recommend using element_at(~) rather than [~] syntax because element_at(~) allows you to extract elements from the back using negative positioning:

df_res = df.select(F.element_at('my_col', -1).alias('last_val'))
df_res.show()
+--------+
|last_val|
+--------+
| 20|
| 40|
+--------+

This is not possible using the [~] syntax or the getItem(~) method.

In case of out-of-bound indexes

Specifying out-of-bound indexes will return null values:

df_res = df.select(F.element_at('my_col',5))
df_res.show()
+---------------------+
|element_at(my_col, 5)|
+---------------------+
| null|
| null|
+---------------------+

Extracting multiple values from arrays in PySpark Column

To extract multiple values from arrays in a PySpark Column:

col = F.col('my_col')
df_res = df.select(col[0], col[1])
df_res.show()
+---------+---------+
|my_col[0]|my_col[1]|
+---------+---------+
| 10| 20|
| 30| 40|
+---------+---------+

Here, we are extracting the first as well as second values of each list.

Equivalently, we could use element_at(~) once again:

col = F.col('my_col')
df_res = df.select(F.element_at(col,1), F.element_at(col,-1))
df_res.show()
+---------------------+----------------------+
|element_at(my_col, 1)|element_at(my_col, -1)|
+---------------------+----------------------+
| 10| 20|
| 30| 40|
+---------------------+----------------------+

Again, you can provide an alias for each column by using the alias(~) method:

col = F.col('my_col')
df_res = df.select(col[0].alias('1st'), col[1].alias('2nd'))
df_res.show()
+---+---+
|1st|2nd|
+---+---+
| 10| 20|
| 30| 40|
+---+---+
robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Ask a question or leave a feedback...
thumb_up
0
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!