search
Search
Login
Unlock 100+ guides
menu
menu
web
search toc
close
Comments
Log in or sign up
Cancel
Post
account_circle
Profile
exit_to_app
Sign out
What does this mean?
Why is this true?
Give me some examples!
search
keyboard_voice
close
Searching Tips
Search for a recipe:
"Creating a table in MySQL"
Search for an API documentation: "@append"
Search for code: "!dataframe"
Apply a tag filter: "#python"
Useful Shortcuts
/ to open search panel
Esc to close search panel
to navigate between search results
d to clear all current filters
Enter to expand content preview
icon_star
Doc Search
icon_star
Code Search Beta
SORRY NOTHING FOUND!
mic
Start speaking...
Voice search is only supported in Safari and Chrome.
Navigate to

PySpark SQL Functions | split method

schedule Aug 12, 2023
Last updated
local_offer
PySpark
Tags
mode_heat
Master the mathematics behind data science with 100+ top-tier guides
Start your free 7-days trial now!

PySpark SQL Functions' split(~) method returns a new PySpark column of arrays containing splitted tokens based on the specified delimiter.

Parameters

1. str | string or Column

The column in which to perform the splitting.

2. pattern | string

The regular expression that serves as the delimiter.

3. limit | int | optional

  • if limit > 0, then the resulting array of splitted tokens will contain at most limit tokens.

  • if limit <=0, then there is no limit as to how many splits we perform.

By default, limit=-1.

Return Value

A new PySpark column.

Examples

Consider the following PySpark DataFrame:

df = spark.createDataFrame([("A#A",), ("B##B",), ("#C#C#C#",), (None,)], ["x",])
df.show()
+-------+
| x|
+-------+
| A#A|
| B##B|
|#C#C#C#|
| null|
+-------+

Splitting strings by delimiter in PySpark Column

To split the strings in column x by "#", use the split(~) method:

df.select(F.split("x", "#")).show()
+---------------+
|split(x, #, -1)|
+---------------+
| [A, A]|
| [B, , B]|
| [, C, C, C, ]|
| null|
+---------------+

Here, note the following:

  • the second delimiter parameter is actually parsed as a regular expression - we will see an example of this later.

  • splitting null results in null.

We can also specify the maximum number of splits to perform using the optional parameter limit:

df.select(F.split("x", "#", 2)).show()
+--------------+
|split(x, #, 2)|
+--------------+
| [A, A]|
| [B, #B]|
| [, C#C#C#]|
| null|
+--------------+

Here, the array containing the splitted tokens can be at most length 2. This is the reason why we still see our delimiter substring "#" in there.

Splitting strings using regular expression in PySpark Column

Consider the following PySpark DataFrame:

df = spark.createDataFrame([("A#A",), ("B@B",), ("C#@C",)], ["x",])
df.show()
+----+
| x|
+----+
| A#A|
| B@B|
|C#@C|
+----+

To split by either the characters # or @, we can use a regular expression as the delimiter:

df.select(F.split("x", "[#@]")).show()
+------------------+
|split(x, [#@], -1)|
+------------------+
| [A, A]|
| [B, B]|
| [C, , C]|
+------------------+

Here, the regular expression [#@] denotes either # or @.

robocat
Published by Isshin Inada
Edited by 0 others
Did you find this page useful?
thumb_up
thumb_down
Comment
Citation
Ask a question or leave a feedback...
thumb_up
2
thumb_down
0
chat_bubble_outline
0
settings
Enjoy our search
Hit / to insta-search docs and recipes!