Why does my regular expression return tuples for every character in a string? [duplicate]
I am making a simple project for my math class in which I want to verify if a given function body (string) only contains the allowed expressions (digits, basic trigonometry, +, -, *, /).
I am using regular expressions with the re.findall
method.
My current code:
import re
def valid_expression(exp) -> bool:
# remove white spaces
exp = exp.replace(" ", "")
# characters to search for
chars = r"(cos)|(sin)|(tan)|[\d+/*x)(-]"
z = re.findall(chars, exp)
return "".join(z) == exp
However, when I test this any expression the re.findall(chars, exp)
will return a list of tuples with 3 empty strings: ('', '', '')
for every character in the string unless there is a trig function in which case it will return a tuple with the trig function and two empty strings.
Ex: cos(x) -> [('cos', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]
I don't understand why it does this, I have tested the regular expression on regexr.com and it works fine. I get that it uses javascript but normally there should be no difference right ?
Thank you for any explanation and/or fix.
Short answer: If the result you want is ['cos', '(', 'x', ')']
, you need something like
'(cos|sin|tan|[)(-*x]|\d+)'
:
>>> re.findall(r'(cos|sin|tan|[)(-*x]|\d+)', "cos(x)")
['cos', '(', 'x', ')']
From the documentation for findall
:
The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
For 'cos(x)'
, you start with ('cos', '', '')
because cos
matched, but neither sin
nor tan
matched. For each of (
, x
, and )
, none of the three capture groups matched, although the bracket expression did. Since it isn't inside a capture group, anything it matches isn't included in your output.
As an aside, [\d+/*x)(-]
doesn't include multidigit integers as a match. \d+
is not a regular expression; it's the two characters d
and +
. (The escape is a no-op, since d
has no special meaning inside [...]
.) As a result, it matches exactly one of the following eight characters:
d
+
/
*
x
)
(
-