splitting at underscore in python and storing the first value
I have a pandas data frame like df with a column construct_name
construct_name
aaaa_t1_2
cccc_t4_10
bbbb_g3_3
and so on. I want to first split all the names at the underscore and store the first element (aaaa,cccc, etc.) as another column name.
Expected output
construct_name name
aaaa_t1_2 aaaa
cccc_t4_10 bbbb
and so on.
I tried the following
df['construct_name'].map(lambda row:row.split("_"))
and it gives me a list like
[aaaa,t1,2]
[cccc,t4,10]
and so on
But when I do
df['construct_name'].map(lambda row:row.split("_"))[0]
to get the first element of the list I get an error. Can you suggest a fix. Thanks
Solution 1:
Just use the vectorised str
method split
and use integer indexing on the list to get the first element:
In [228]:
df['first'] = df['construct_name'].str.split('_').str[0]
df
Out[228]:
construct_name first
0 aaaa_t1_2 aaaa
1 cccc_t4_10 cccc
2 bbbb_g3_3 bbbb
Solution 2:
After you do the split
, you should get the first element (using [0]). And not after the map
.:
In [608]: temp['name'] = temp['construct_name'].map(lambda v: v.split('_')[0])
In [609]: temp
Out[609]:
construct_name name
0 aaaa_t1_2 aaaa
1 cccc_t4_10 cccc
2 bbbb_g3_3 bbbb
Solution 3:
split
take an optional argument maxsplit
:
>>> construct_name = 'aaaa_t1_2'
>>> name, rest = construct_name.split('_', 1)
>>> name
'aaaa'
Solution 4:
Another way of using the vectorised str.split
method is passing the expand=True
flag which then returns one column for each of the split parts.
>>> s = pd.Series( ['aaaa_t1_2', 'cccc_t4_10', 'bbbb_g3_3'], name='construct_name')
>>> s.str.split('_', expand=True) # to see what expand=True does
0 1 2
0 aaaa t1 2
1 cccc t4 10
2 bbbb g3 3
>>> s.str.split('_', expand=True)[0] # what you want, select first elements
0 aaaa
1 cccc
2 bbbb
This would be specially useful if you wanted to keep the first and second values for example.
In terms of the general behaviour of the expand=True
flag, note that if the input strings do not have the same number of underscores you ca get None
s:
>>> s = pd.Series( ['aaaa_t1_2', 'cccc_t4', 'bbbb_g33'], name='construct_name')
>>> s.str.split('_', expand=True)
0 1 2
0 aaaa t1 2
1 cccc t4 None
2 bbbb g33 None
Solution 5:
df['name'] = df['construct_name'].str.split('_').str.get(0)
or
df['name'] = df['construct_name'].str.split('_').apply(lambda x: x[0])