Why does my regular expression return tuples for every character in a string? [duplicate]

I am making a simple project for my math class in which I want to verify if a given function body (string) only contains the allowed expressions (digits, basic trigonometry, +, -, *, /). I am using regular expressions with the re.findall method. My current code:

import re

def valid_expression(exp) -> bool:
    # remove white spaces
    exp = exp.replace(" ", "")

    # characters to search for
    chars = r"(cos)|(sin)|(tan)|[\d+/*x)(-]"

    z = re.findall(chars, exp)
    
    return "".join(z) == exp

However, when I test this any expression the re.findall(chars, exp) will return a list of tuples with 3 empty strings: ('', '', '') for every character in the string unless there is a trig function in which case it will return a tuple with the trig function and two empty strings.

Ex: cos(x) -> [('cos', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]

I don't understand why it does this, I have tested the regular expression on regexr.com and it works fine. I get that it uses javascript but normally there should be no difference right ?

Thank you for any explanation and/or fix.

Short answer: If the result you want is ['cos', '(', 'x', ')'], you need something like '(cos|sin|tan|[)(-*x]|\d+)':

>>> re.findall(r'(cos|sin|tan|[)(-*x]|\d+)', "cos(x)")
['cos', '(', 'x', ')']

From the documentation for findall:

The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.

For 'cos(x)', you start with ('cos', '', '') because cos matched, but neither sin nor tan matched. For each of (, x, and ), none of the three capture groups matched, although the bracket expression did. Since it isn't inside a capture group, anything it matches isn't included in your output.

As an aside, [\d+/*x)(-] doesn't include multidigit integers as a match. \d+ is not a regular expression; it's the two characters d and +. (The escape is a no-op, since d has no special meaning inside [...].) As a result, it matches exactly one of the following eight characters:

Why does my regular expression return tuples for every character in a string? [duplicate]

Related

Recent Posts