Regex match strings divided by 'and'

I need to parse a string to get desired number and position form a string, for example:

2 Better Developers and 3 Testers
5 Mechanics and chef
medic and 3 nurses

Currently I am using code like this which returns list of tuples, like [('2', 'Better Developers'), ('3', 'Testers')]:

def parse_workers_list_from_str(string_value: str) -> [(str, str)]:
    result: [(str, str)] = []
    if string_value:
        for part in string_value.split('and'):
            result.append(re.findall(r'(?: *)(\d+|)(?: |)([\w ]+)', part.strip())[0])
    return result

Can I do it without .split() using only regex?


Solution 1:

If you want to handle multiple and splitters then you should consider using PyPi regex module that allows us to use branch reset group i.e. (?!...) that provides Subpatterns declared within each alternative of this construct will start over from the same index.

(?|(\d*) *(\b[a-z]+(?: [a-z]+)*?)(?= and )|(?<= and )(\d*) *(\b[a-z]+(?: [a-z]+)*))

RegEx Demo

import regex
rx = regex.compile(r'(?|(\d*) *(\b[a-z]+(?: [a-z]+)*?)(?= and )|(?<= and )(\d*) *(\b[a-z]+(?: [a-z]+)*))', regex.I)

arr = ['2 Better Developers and 3 Testers', '5 Mechanics and chef', 'medic and 3 nurses', '5 foo', '5 Mechanics and 2 chefs and tester']
for s in arr: print (rx.findall(s), ':', s)

Output:

[('2', 'Better Developers'), ('3', 'Testers')] : 2 Better Developers and 3 Testers
[('5', 'Mechanics'), ('', 'chef')] : 5 Mechanics and chef
[('', 'medic'), ('3', 'nurses')] : medic and 3 nurses
[] : 5 foo
[('5', 'Mechanics'), ('2', 'chefs'), ('', 'tester')] : 5 Mechanics and 2 chefs and tester

Earlier Answer that was posted as per the original question with presence of single and.

You may use this regex:

(\d*) *(\S+(?: \S+)*?) and (\d*) *(\S+(?: \S+)*)

Here we match and surrounded with a single space on either side. Before and after and we match using this sub-pattern:

(\d*) *(\S+(?: \S+)*?)

Which match optional 0+ digits to start with followed by 0 or more spaces followed by 1 or more non-whitespace strings separated by a space.

RegEx Demo

Code:

import re
arr = ['2 Better Developers and 3 Testers', '5 Mechanics and chef', 'medic and 3 nurses', '5 foo']

rx = re.compile(r'(\d*) *(\S+(?: \S+)*?) and (\d*) *(\S+(?: \S+)*)')

for s in arr: print (rx.findall(s))

Output:

[('2', 'Better Developers', '3', 'Testers')]
[('5', 'Mechanics', '', 'chef')]
[('', 'medic', '3', 'nurses')]
[]

Solution 2:

Together with re.MULTILINE you can do everything in one regex, that will also split everything correctly:

>>> s = """2 Better Developers and 3 Testers
5 Mechanics and chef
medic and 3 nurses"""
>>> re.findall(r"\s*(\d*)\s*(.+?)(?:\s+and\s+|$)", s, re.MULTILINE)
[('2', 'Better Developers'), ('3', 'Testers'), ('5', 'Mechanics'), ('', 'chef'), ('', 'medic'), ('3', 'nurses')]

With explanation and conversion of empty '' to 1:

import re

s = """2 Better Developers and 3 Testers
5 Mechanics and chef
medic and 3 nurses"""

results = re.findall(r"""
    # Capture the number if one exists
    (\d*)
    # Remove spacing between number and text
    \s*
    # Caputre the text
    (.+?)
    # Attempt to match the word 'and' or the end of the line
    (?:\s+and\s+|$\n?)
    """, s, re.MULTILINE|re.VERBOSE)

results = [(int(n or 1), t.title()) for n, t in results]
results == [(2, 'Better Developers'), (3, 'Testers'), (5, 'Mechanics'), (1, 'Chef'), (1, 'Medic'), (3, 'Nurses')]