PySpark SQL Functions | split method
Start your free 7-days trial now!
PySpark SQL Functions' split(~) method returns a new PySpark column of arrays containing splitted tokens based on the specified delimiter.
Parameters
1. str | string or Column
The column in which to perform the splitting.
2. pattern | string
The regular expression that serves as the delimiter.
3. limit | int | optional
if
limit > 0, then the resulting array of splitted tokens will contain at mostlimittokens.if
limit <=0, then there is no limit as to how many splits we perform.
By default, limit=-1.
Return Value
A new PySpark column.
Examples
Consider the following PySpark DataFrame:
+-------+| x|+-------+| A#A|| B##B||#C#C#C#|| null|+-------+
Splitting strings by delimiter in PySpark Column
To split the strings in column x by "#", use the split(~) method:
Here, note the following:
the second delimiter parameter is actually parsed as a regular expression - we will see an example of this later.
splitting
nullresults innull.
We can also specify the maximum number of splits to perform using the optional parameter limit:
Here, the array containing the splitted tokens can be at most length 2. This is the reason why we still see our delimiter substring "#" in there.
Splitting strings using regular expression in PySpark Column
Consider the following PySpark DataFrame:
+----+| x|+----+| A#A|| B@B||C#@C|+----+
To split by either the characters # or @, we can use a regular expression as the delimiter:
Here, the regular expression [#@] denotes either # or @.