PySpark SQL Functions | regexp_extract method
Start your free 7-days trial now!
PySpark SQL Functions' regexp_extract(~) method extracts a substring using regular expression.
Parameters
1. str | string or Column
The column whose substrings will be extracted.
2. pattern | string or Regex
The regular expression pattern used for substring extraction.
3. idx | int
The group from which to extract values. Consult the examples below for clarification.
Return Value
A new PySpark Column.
Examples
Consider the following PySpark DataFrame:
+--------+---+| id|age|+--------+---+|id_20_30| 10||id_40_50| 30|+--------+---+
Extracting a specific substring
To extract the first number in each id value, use regexp_extract(~) like so:
Here, the regular expression (\d+) matches one or more digits (20 and 40 in this case). We set the third argument value as 1 to indicate that we are interested in extracting the first matched group - this argument is useful when we capture multiple groups.
Extracting the n-th captured substring
We can use multiple (~) capture groups for regexp_extract(~) like so:
Here, we set the third argument value to 2 to indicate that we are interested in extracting the values captured by the second group.