Extracting multiple substrings from one string

I have the following string which I am parsing from another file : "CHEM1(5GL) CH3M2(55LB) CHEM3954114(50KG)" What I want to do is split them up into individual values, which I achieve using the .split() function. So I get them as an array:

x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']

Now I want to further split them into 3 segments, and store them in 3 other variables so I can write them to excel as such :

a = CHEM1
b = 5
c = GL

for the first array, then I will loop back for the second array:

a = CH3M2
b = 55
c = LB

and finally :

a = CHEM3954114
b = 50
c = KG

I am unsure how to go about that as I am still new in python. To the best of my acknowledge I iterate multiple times with the split function, but I believe there has to be a better way to do it than that.

Thank you.


You should use the re package:

import re

x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']

pattern = re.compile("([^\(]+)\((\d+)(.+)\)")

for x1 in x:
    m = pattern.search(x1)
    if m:
        a, b, c = m.group(1), int(m.group(2)), m.group(3)

FOLLOW UP:

The regex topic is enormous and extremely well covered on this site - as Tim has highlighted above. I can share my thinking for this specific case. Essentially, there are 3 groups of characters you want to extract:

  1. All the characters (letters and numbers) up to the ( - not included
  2. The digits after the (
  3. The letters after the digits extracted in the previous step - up to the ) - not included.

A group is anything included between brackets (): in this specific case, it may become confusing because, as stressed above, you have brackets as part of sentence - which will need to be escaped with a \ to be distinguished from the ones used in the regular expression.

  • The first group is ([^\(]+), which essentially means: match one or more characters which are not ( (the ^ is the negation, and the bracket ( needs to be escaped here, for the reasons described above). Note that characters may include not only letters and numbers but also special characters like $, £, - and so forth. I wanted to keep my options open here, but you can be more laser guided if you need (including, for example, only numbers and letters using [\w]+)
  • The second group is (\d+), which is essentially matching 1 or more (expressed with +) digits (expressed with \d).
  • The last group is (.+) - match any remaining characters, with the final \) making sure that you match any remaining characters up to the closing bracket.

Using re.findall we can try:

x = ['CHEM1(5GL)', 'CH3M2(55LB)','CHEM3954114(50KG)']
for inp in x:
    matches = re.findall(r'(\w+)\((\d+)(\w+)\)', inp)
    print(matches)

# [('CHEM1', '5', 'GL')]
# [('CH3M2', '55', 'LB')]
# [('CHEM3954114', '50', 'KG')]

Considering the elements you provided in your question, I assume that there can not be '(' more than once in an element.

Here is the function I wrote.

def decontruct(chem):
  name = chem[:chem.index('(')]
  qty = chem[chem.index('(') + 1:-1]
  mag, unit = "", ""
  for char in qty:
      if char.isalpha():
          unit += char
      else:
          mag += char
  return {"name": name, "mag": float(mag), "unit": unit} # If you don't want to convert mag into float then just use int(mag) instead of float(mag).

Usage:

x = ['CHEM1(5.4GL)', 'CH3M2(55LB)', 'CHEM3954114(50KG)']

for chem in x:
  d = decontruct(chem)
  print(d["name"], d["mag"], d["unit"])