How can I split a string of a mathematical expressions in python?
I made a program which convert infix to postfix in python. The problem is when I introduce the arguments. If i introduce something like this: (this will be a string)
( ( 73 + ( ( 34 - 72 ) / ( 33 - 3 ) ) ) + ( 56 + ( 95 - 28 ) ) )
it will split it with .split() and the program will work correctly. But I want the user to be able to introduce something like this:
((73 + ( (34- 72 ) / ( 33 -3) )) + (56 +(95 - 28) ) )
As you can see I want that the blank spaces can be trivial but the program continue splitting the string by parentheses, integers (not digits) and operands.
I try to solve it with a for
but I don't know how to catch the whole number (73 , 34 ,72) instead one digit by digit (7, 3 , 3 , 4 , 7 , 2)
To sum up, what I want is split a string like ((81 * 6) /42+ (3-1))
into:
[(, (, 81, *, 6, ), /, 42, +, (, 3, -, 1, ), )]
Tree with ast
You could use ast
to get a tree of the expression :
import ast
source = '((81 * 6) /42+ (3-1))'
node = ast.parse(source)
def show_children(node, level=0):
if isinstance(node, ast.Num):
print(' ' * level + str(node.n))
else:
print(' ' * level + str(node))
for child in ast.iter_child_nodes(node):
show_children(child, level+1)
show_children(node)
It outputs :
<_ast.Module object at 0x7f56abbc5490>
<_ast.Expr object at 0x7f56abbc5350>
<_ast.BinOp object at 0x7f56abbc5450>
<_ast.BinOp object at 0x7f56abbc5390>
<_ast.BinOp object at 0x7f56abb57cd0>
81
<_ast.Mult object at 0x7f56abbd0dd0>
6
<_ast.Div object at 0x7f56abbd0e50>
42
<_ast.Add object at 0x7f56abbd0cd0>
<_ast.BinOp object at 0x7f56abb57dd0>
3
<_ast.Sub object at 0x7f56abbd0d50>
1
As @user2357112 wrote in the comments : ast.parse
interprets Python syntax, not mathematical expressions. (1+2)(3+4)
would be parsed as a function call and list comprehensions would be accepted even though they probably shouldn't be considered a valid mathematical expression.
List with a regex
If you want a flat structure, a regex could work :
import re
number_or_symbol = re.compile('(\d+|[^ 0-9])')
print(re.findall(number_or_symbol, source))
# ['(', '(', '81', '*', '6', ')', '/', '42', '+', '(', '3', '-', '1', ')', ')']
It looks for either :
- multiple digits
- or any character which isn't a digit or a space
Once you have a list of elements, you could check if the syntax is correct, for example with a stack
to check if parentheses are matching, or if every element is a known one.
You need to implement a very simple tokenizer for your input. You have the following types of tokens:
- (
- )
- +
- -
- *
- /
- \d+
You can find them in your input string separated by all sorts of white space.
So a first step is to process the string from start to finish, and extract these tokens, and then do your parsing on the tokens, rather than on the string itself.
A nifty way to do this is to use the following regular expression: '\s*([()+*/-]|\d+)'
. You can then:
import re
the_input='(3+(2*5))'
tokens = []
tokenizer = re.compile(r'\s*([()+*/-]|\d+)')
current_pos = 0
while current_pos < len(the_input):
match = tokenizer.match(the_input, current_pos)
if match is None:
raise Error('Syntax error')
tokens.append(match.group(1))
current_pos = match.end()
print(tokens)
This will print ['(', '3', '+', '(', '2', '*', '5', ')', ')']
You could also use re.findall
or re.finditer
, but then you'd be skipping non-matches, which are syntax errors in this case.
If you don't want to use re
module, you can try this:
s="((81 * 6) /42+ (3-1))"
r=[""]
for i in s.replace(" ",""):
if i.isdigit() and r[-1].isdigit():
r[-1]=r[-1]+i
else:
r.append(i)
print(r[1:])
Output:
['(', '(', '81', '*', '6', ')', '/', '42', '+', '(', '3', '-', '1', ')', ')']
It actual would be pretty trivial to hand-roll a simple expression tokenizer. And I'd think you'd learn more that way as well.
So for the sake of education and learning, Here is a trivial expression tokenizer implementation which can be extended. It works based upon the "maximal-much" rule. This means it acts "greedy", trying to consume as many characters as it can to construct each token.
Without further ado, here is the tokenizer:
class ExpressionTokenizer:
def __init__(self, expression, operators):
self.buffer = expression
self.pos = 0
self.operators = operators
def _next_token(self):
atom = self._get_atom()
while atom and atom.isspace():
self._skip_whitespace()
atom = self._get_atom()
if atom is None:
return None
elif atom.isdigit():
return self._tokenize_number()
elif atom in self.operators:
return self._tokenize_operator()
else:
raise SyntaxError()
def _skip_whitespace(self):
while self._get_atom():
if self._get_atom().isspace():
self.pos += 1
else:
break
def _tokenize_number(self):
endpos = self.pos + 1
while self._get_atom(endpos) and self._get_atom(endpos).isdigit():
endpos += 1
number = self.buffer[self.pos:endpos]
self.pos = endpos
return number
def _tokenize_operator(self):
operator = self.buffer[self.pos]
self.pos += 1
return operator
def _get_atom(self, pos=None):
pos = pos or self.pos
try:
return self.buffer[pos]
except IndexError:
return None
def tokenize(self):
while True:
token = self._next_token()
if token is None:
break
else:
yield token
Here is a demo the usage:
tokenizer = ExpressionTokenizer('((81 * 6) /42+ (3-1))', {'+', '-', '*', '/', '(', ')'})
for token in tokenizer.tokenize():
print(token)
Which produces the output:
(
(
81
*
6
)
/
42
+
(
3
-
1
)
)
Quick regex answer:
re.findall(r"\d+|[()+\-*\/]", str_in)
Demonstration:
>>> import re
>>> str_in = "((81 * 6) /42+ (3-1))"
>>> re.findall(r"\d+|[()+\-*\/]", str_in)
['(', '(', '81', '*', '6', ')', '/', '42', '+', '(', '3', '-', '1',
')', ')']
For the nested parentheses part, you can use a stack to keep track of the level.