df = spark.createDataFrame([['id_20_30', 10], ['id_40_50', 30]], ['id', 'age'])
df.show()
                
            
            +--------+---+
|      id|age|
+--------+---+
|id_20_30| 10|
|id_40_50| 30|
+--------+---+

Extracting a specific substring

To extract the first number in each id value, use regexp_extract(~) like so:


        
        
            
                
                
                    from pyspark.sql import functions as F
df.select(F.regexp_extract('id', '(\d+)', 1)).show()
                
            
            +----------------------------+
|regexp_extract(id, (\d+), 1)|
+----------------------------+
|                          20|
|                          40|
+----------------------------+

Here, the regular expression (\d+) matches one or more digits (20 and 40 in this case). We set the third argument value as 1 to indicate that we are interested in extracting the first matched group - this argument is useful when we capture multiple groups.

Extracting the n-th captured substring

We can use multiple (~) capture groups for regexp_extract(~) like so:


        
        
            
                
                
                    from pyspark.sql import functions as F
df.select(F.regexp_extract('id', '(\d+)_(\d+)', 2)).show()
                
            
            +----------------------------------+
|regexp_extract(id, (\d+)_(\d+), 2)|
+----------------------------------+
|                                30|
|                                50|
+----------------------------------+