Extracting multiple substrings from one string
I have the following string which I am parsing from another file : "CHEM1(5GL) CH3M2(55LB) CHEM3954114(50KG)" What I want to do is split them up into individual values, which I achieve using the .split() function. So I get them as an array:
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
Now I want to further split them into 3 segments, and store them in 3 other variables so I can write them to excel as such :
a = CHEM1
b = 5
c = GL
for the first array, then I will loop back for the second array:
a = CH3M2
b = 55
c = LB
and finally :
a = CHEM3954114
b = 50
c = KG
I am unsure how to go about that as I am still new in python. To the best of my acknowledge I iterate multiple times with the split function, but I believe there has to be a better way to do it than that.
Thank you.
You should use the re
package:
import re
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
pattern = re.compile("([^\(]+)\((\d+)(.+)\)")
for x1 in x:
m = pattern.search(x1)
if m:
a, b, c = m.group(1), int(m.group(2)), m.group(3)
FOLLOW UP:
The regex topic is enormous and extremely well covered on this site - as Tim has highlighted above. I can share my thinking for this specific case. Essentially, there are 3 groups of characters you want to extract:
- All the characters (letters and numbers) up to the
(
- not included - The digits after the
(
- The letters after the digits extracted in the previous step - up to the
)
- not included.
A group is anything included between brackets ()
: in this specific case, it may become confusing because, as stressed above, you have brackets as part of sentence - which will need to be escaped with a \
to be distinguished from the ones used in the regular expression.
- The first group is
([^\(]+)
, which essentially means: match one or more characters which are not(
(the^
is the negation, and the bracket(
needs to be escaped here, for the reasons described above). Note that characters may include not only letters and numbers but also special characters like $, £, - and so forth. I wanted to keep my options open here, but you can be more laser guided if you need (including, for example, only numbers and letters using[\w]+
) - The second group is
(\d+)
, which is essentially matching 1 or more (expressed with+
) digits (expressed with\d
). - The last group is
(.+)
- match any remaining characters, with the final\)
making sure that you match any remaining characters up to the closing bracket.
Using re.findall
we can try:
x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
for inp in x:
matches = re.findall(r'(\w+)\((\d+)(\w+)\)', inp)
print(matches)
# [('CHEM1', '5', 'GL')]
# [('CH3M2', '55', 'LB')]
# [('CHEM3954114', '50', 'KG')]
Considering the elements you provided in your question, I assume that there can not be '(' more than once in an element.
Here is the function I wrote.
def decontruct(chem):
name = chem[:chem.index('(')]
qty = chem[chem.index('(') + 1:-1]
mag, unit = "", ""
for char in qty:
if char.isalpha():
unit += char
else:
mag += char
return {"name": name, "mag": float(mag), "unit": unit} # If you don't want to convert mag into float then just use int(mag) instead of float(mag).
Usage:
x = ['CHEM1(5.4GL)', 'CH3M2(55LB)', 'CHEM3954114(50KG)']
for chem in x:
d = decontruct(chem)
print(d["name"], d["mag"], d["unit"])