How to split a string by commas positioned outside of parenthesis?

Solution 1:

One way to do it is to use findall with a regex that greedily matches things that can go between separators. eg:

>>> s = "Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley, Jane Doe (Jane Doe)"
>>> r = re.compile(r'(?:[^,(]|\([^)]*\))+')
>>> r.findall(s)
['Wilbur Smith (Billy, son of John)', ' Eddie Murphy (John)', ' Elvis Presley', ' Jane Doe (Jane Doe)']

The regex above matches one or more:

  • non-comma, non-open-paren characters
  • strings that start with an open paren, contain 0 or more non-close-parens, and then a close paren

One quirk about this approach is that adjacent separators are treated as a single separator. That is, you won't see an empty string. That may be a bug or a feature depending on your use-case.

Also note that regexes are not suitable for cases where nesting is a possibility. So for example, this would split incorrectly:

"Wilbur Smith (son of John (Johnny, son of James), aka Billy), Eddie Murphy (John)"

If you need to deal with nesting your best bet would be to partition the string into parens, commas, and everthing else (essentially tokenizing it -- this part could still be done with regexes) and then walk through those tokens reassembling the fields, keeping track of your nesting level as you go (this keeping track of the nesting level is what regexes are incapable of doing on their own).

Solution 2:

s = re.split(r',\s*(?=[^)]*(?:\(|$))', x) 

The lookahead matches everything up to the next open-parenthesis or to the end of the string, iff there's no close-parenthesis in between. That ensures that the comma is not inside a set of parentheses.

Solution 3:

I think the best way to approach this would be to use python's built-in csv module.

Because the csv module only allows a one character quotechar, you would need to do a replace on your inputs to convert () to something like | or ". Then make sure you are using an appropriate dialect and off you go.

Solution 4:

An attempt on human-readable regex:

import re

regex = re.compile(r"""
    # name starts and ends on word boundary
    # no '(' or commas in the name
    (?P<name>\b[^(,]+\b)
    \s*
    # everything inside parentheses is a role
    (?:\(
      (?P<role>[^)]+)
    \))? # role is optional
    """, re.VERBOSE)

s = ("Wilbur Smith (Billy, son of John), Eddie Murphy (John), Elvis Presley,"
     "Jane Doe (Jane Doe)")
print re.findall(regex, s)

Output:

[('Wilbur Smith', 'Billy, son of John'), ('Eddie Murphy', 'John'), 
 ('Elvis Presley', ''), ('Jane Doe', 'Jane Doe')]