Issues with extracting substrings of a string in Python Pandas Dataframe

I have an expression like ( one row of a column, say 'old_col' in pandas data frame) ( Shown the top two rows from a column of the dataframe )

abcd_6.9_uuu ghaha_12.8 _sksks

abcd_5.2_uuu ghaha_13.9 _sksks

I was trying to use the str.extract on the dataframe to get the two floating numbers. However I find two issues, only the first one is picked up( 6.9 from first row and 5.2 from second row )

1. So how can I do that?

2. Also how can I make the extract method general to pick numbers upto any digits ( 5.7or 12.9 irrespective)

I am using: df['newcol'] = df['old_col'].str.extract('(_\d.\d)')


To get more than one digit,

df['col'].str.extract('(\_\d+\.\d+)')

    col
0   _6.9
1   _15.9

To get all occurrences, use str.extractall

df['col'].str.extractall('(\_\d+\.\d+)')

            col
    match   
0   0      _6.9
    1     _12.8
1   0     _15.9
    1     _13.9

To assign back to df:

s = df['col'].str.extractall('(\_\d+\.\d+)')['col']
df['new_col'] = s.groupby(s.index.get_level_values(0)).agg(list)

You can use Series.str.findall:

import pandas as pd
df=pd.DataFrame({'old_col':['abcd_6.9_uuu ghaha_12.8 _sksks','abcd_5.2_uuu ghaha_13.9 _sksks']})
df['newcol'] = df['old_col'].str.findall(r'\d+(?:\.\d+)?')
df['newcol_str'] = df['old_col'].str.findall(r'\d+(?:\.\d+)?').str.join(', ')
# >>> df
#                           old_col       newcol newcol_str
# 0  abcd_6.9_uuu ghaha_12.8 _sksks  [6.9, 12.8]  6.9, 12.8
# 1  abcd_5.2_uuu ghaha_13.9 _sksks  [5.2, 13.9]  5.2, 13.9

Regex details:

  • \d+(?:\.\d+)? - one or more digits followed with an optional occurrence of a . and one or more digits
  • \d+\.\d+ would match only float values where the . is obligatory between at least two digits.

Since .str.findall(r'\d+(?:\.\d+)?') returns a list, the newcol column contains lists, with .str.join(', '), the newcol_str column contains strings with found matches merged.

If you must check if the numbers occur between underscores add them on both sides of the pattern and wrap the number matching pattern with parentheses:

.str.findall(r'_(\d+(?:\.\d+)?)_')