Parsing strings of user input using the cparse library from git
Failing that, is there another way to parse strings of user input and use them as functions?
I would like to answer this part of your question because I have the feeling that a full-featured C parser might be a little bit too heavy for what might be your intention. (Btw. once you got the C parser running – how to process its output? Dynamic linking?)
Instead, I want to show you how to build a simple calculator (with recursive descent parser) on your own. For the documentation of the techniques I will use, I warmly recommend Compilers (Principles, Techniques & Tools) by Aho, Lam, Sethi, Ullman (better known as "Dragon books") and especially Chapter 4.
The Tiny Calculator Project
In the following I describe my sample solution part by part.
Language Design
Before starting to write a compiler or interpreter, it's reasonable to define a language which shall be accepted. I want to use a very limited sub-set of C: expressions consisting of
- C like floating point numbers (constants)
- C like identifiers (variables)
- unary operators
+
and-
- binary operators
+
,-
,*
, and/
- parentheses
()
- semicolons
;
(to mark the end of an expression, mandatory).
Whitespaces (including line-breaks) will be simply ignored but may be used to separate things as well as to improve human readability. C or C++ like comments (and a lot of other sugar) I didn't consider to keep the source code as minimal as possible. (For all that, I got nearly 500 lines.)
The specific example of the OP will fit into this language with an added semicolon:
a*(b+3);
There will be only one type supported: double
. Thus, I don't need types or any declaration which makes things easier.
Before I started to sketch the grammar of this language, I was thinking about the "target" of compiling and decided to make classes for...
An Abstract Syntax Tree
At first – a class to store variables:
// storage of variables
class Var {
private:
double _value;
public:
Var(): _value() { }
~Var() = default;
double get() const { return _value; }
void set(double value) { _value = value; }
};
The variables provide storage for a value but not for the identifier. The latter is stored separately as it is not needed for the actual usage of a variable but only to find it by name:
typedef std::map<std::string, Var> VarTable;
The usage of a std::map
automates the creation of a variable. As known from many high level languages, a variable starts to exist when its accessed the first time.
The Abstract Syntax Tree is a
a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code.
I took this text from the above linked Wikipedia article – couldn't said it shorter. In the following my classes for the AST:
// abstract syntax tree -> storage of "executable"
namespace AST {
class Expr {
protected:
Expr() = default;
public:
virtual ~Expr() = default;
public:
virtual double solve() const = 0;
};
class ExprConst: public Expr {
private:
double _value;
public:
ExprConst(double value): Expr(), _value(value) { }
virtual ~ExprConst() = default;
virtual double solve() const { return _value; }
};
class ExprVar: public Expr {
private:
Var *_pVar;
public:
ExprVar(Var *pVar): Expr(), _pVar(pVar) { }
virtual ~ExprVar() = default;
virtual double solve() const { return _pVar->get(); }
};
class ExprUnOp: public Expr {
protected:
Expr *_pArg1;
protected:
ExprUnOp(Expr *pArg1): Expr(), _pArg1(pArg1) { }
virtual ~ExprUnOp() { delete _pArg1; }
};
class ExprUnOpNeg: public ExprUnOp {
public:
ExprUnOpNeg(Expr *pArg1): ExprUnOp(pArg1) { }
virtual ~ExprUnOpNeg() = default;
virtual double solve() const
{
return -_pArg1->solve();
}
};
class ExprBinOp: public Expr {
protected:
Expr *_pArg1, *_pArg2;
protected:
ExprBinOp(Expr *pArg1, Expr *pArg2):
Expr(), _pArg1(pArg1), _pArg2(pArg2)
{ }
virtual ~ExprBinOp() { delete _pArg1; delete _pArg2; }
};
class ExprBinOpAdd: public ExprBinOp {
public:
ExprBinOpAdd(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
virtual ~ExprBinOpAdd() = default;
virtual double solve() const
{
return _pArg1->solve() + _pArg2->solve();
}
};
class ExprBinOpSub: public ExprBinOp {
public:
ExprBinOpSub(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
virtual ~ExprBinOpSub() = default;
virtual double solve() const
{
return _pArg1->solve() - _pArg2->solve();
}
};
class ExprBinOpMul: public ExprBinOp {
public:
ExprBinOpMul(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
virtual ~ExprBinOpMul() = default;
virtual double solve() const
{
return _pArg1->solve() * _pArg2->solve();
}
};
class ExprBinOpDiv: public ExprBinOp {
public:
ExprBinOpDiv(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
virtual ~ExprBinOpDiv() = default;
virtual double solve() const
{
return _pArg1->solve() / _pArg2->solve();
}
};
Thus, using the AST classes, the representation of sample a*(b+3)
would be
VarTable varTable;
Expr *pExpr
= new ExprBinOpMul(
new ExprVar(&varTable["a"]),
new ExprBinOpAdd(
new ExprVar(&varTable["b"]),
new ExprConst(3)));
Note:
There is no class derived from Expr
to represent the parentheses ()
because this is simply not necessary. The processing of parentheses is considered when building the tree itself. Normally, the operators with higher precedence become children of the operators with lower precedence. As a result, the former are computed before the latter. In the above example, the instance of ExprBinOpAdd
is a child of the instance of ExprBinOpMul
(although precedence of multiply is higher than precedence of add) which results from the proper consideration of the parentheses.
Beside of storing a parsed expression, this tree can be used to compute the expression by calling the Expr::solve()
method of the root node:
double result = pExpr->solve();
Having a backend for our Tiny Calculator, next is the front-end.
A Grammar
A formal language is best described by a grammar.
program
: expr Semicolon program
| <empty>
;
expr
: addExpr
;
addExpr
: mulExpr addExprRest
;
addExprRest
: addOp mulExpr addExprRest
| <empty>
;
addOp
: Plus | Minus
;
mulExpr
: unaryExpr mulExprRest
;
mulExprRest
: mulOp unaryExpr mulExprRest
| <empty>
;
mulOp
: Star | Slash
;
unaryExpr
: unOp unaryExpr
| primExpr
;
unOp
: Plus | Minus
;
primExpr
: Number
| Id
| LParen expr RParen
;
with start symbol program
.
The rules contain
- terminal symbols (starting with uppercase letter) and
- non-terminal symbols (starting with lowercase)
- a colon (
:
) to separate left side from right side (non-terminal on left side may expand to the symbols on the right side). - vertical bars (
|
) to separate alternatives - an
<empty>
symbol for expanding to nothing (used to terminate recursions).
From the terminal symbols, I will derive the tokens for the scanner.
The non-terminal symbols will be transformed into the parser functions.
The separation of addExpr
and mulExpr
has been done intentionally. Thus, the precedence of multiplicative operators over additive operators will be "burnt in" the grammar itself. Obviously, the floating point constants, variable identifiers, or expressions in parentheses (accepted in primExpr
) will have highest precedence.
The rules contain only right recursions. This is a requirement for recursive-descent parsers (as I've learnt theoretically in the Dragon books and from practical experiences in the debugger until I fully understood the reason why). Implementing a left recursive rule in a recursive-descent parser results in non-terminated recursions which, in turn, end up with a StackOverflow.
The Scanner
It's usual to split the compiler in a scanner and a parser.
The Scanner processes the input character stream and groups characters together to tokens. The tokens are used as terminal symbols in the parser.
For the output of tokens, I made a class. In my professional projects, it stores additionally the exact file position to denote its origin. This is convenient to tag created objects with source code references as well as any output of error messages and debugging info. (...left out here to keep it as minimal as possible...)
// token class - produced in scanner, consumed in parser
struct Token {
// tokens
enum Tk {
Plus, Minus, Star, Slash, LParen, RParen, Semicolon,
Number, Id,
EOT, Error
};
// token number
Tk tk;
// lexem as floating point number
double number;
// lexem as identifier
std::string id;
// constructors.
explicit Token(Tk tk): tk(tk), number() { }
explicit Token(double number): tk(Number), number(number) { }
explicit Token(const std::string &id): tk(Id), number(), id(id) { }
};
There are two enumerators for special tokens:
-
EOT
... end of text (remarks the end of input) -
Error
... generated for any character which does not fit in any other token.
The tokens are used as output of the actual scanner:
// the scanner - groups characters to tokens
class Scanner {
private:
std::istream &_in;
public:
// constructor.
Scanner(std::istream &in): _in(in) { }
/* groups characters to next token until the first character
* which does not match (or end-of-file is reached).
*/
Token scan()
{
char c;
// skip white space
do {
if (!(_in >> c)) return Token(Token::EOT);
} while (isspace(c));
// classify character and build token
switch (c) {
case '+': return Token(Token::Plus);
case '-': return Token(Token::Minus);
case '*': return Token(Token::Star);
case '/': return Token(Token::Slash);
case '(': return Token(Token::LParen);
case ')': return Token(Token::RParen);
case ';': return Token(Token::Semicolon);
default:
if (isdigit(c)) {
_in.unget(); double value; _in >> value;
return Token(value);
} else if (isalpha(c) || c == '_') {
std::string id(1, c);
while (_in >> c) {
if (isalnum(c) || c == '_') id += c;
else { _in.unget(); break; }
}
return Token(id);
} else {
_in.unget();
return Token(Token::Error);
}
}
}
};
The scanner is used in the parser.
The Parser
class Parser {
private:
Scanner _scanner;
VarTable &_varTable;
Token _lookAhead;
private:
// constructor.
Parser(std::istream &in, VarTable &varTable):
_scanner(in), _varTable(varTable), _lookAhead(Token::EOT)
{
scan(); // load look ahead initially
}
// calls the scanner to read the next look ahead token.
void scan() { _lookAhead = _scanner.scan(); }
// consumes a specific token.
bool match(Token::Tk tk)
{
if (_lookAhead.tk != tk) {
std::cerr << "SYNTAX ERROR! Unexpected token!" << std::endl;
return false;
}
scan();
return true;
}
// the rules:
std::vector<AST::Expr*> parseProgram()
{
// right recursive rule
// -> can be done as iteration
std::vector<AST::Expr*> pExprs;
for (;;) {
if (AST::Expr *pExpr = parseExpr()) {
pExprs.push_back(pExpr);
} else break;
// special error checking for missing ';' (usual error)
if (_lookAhead.tk != Token::Semicolon) {
std::cerr << "SYNTAX ERROR: Semicolon expected!" << std::endl;
break;
}
scan(); // consume semicolon
if (_lookAhead.tk == Token::EOT) return pExprs;
}
// error handling
for (AST::Expr *pExpr : pExprs) delete pExpr;
pExprs.clear();
return pExprs;
}
AST::Expr* parseExpr()
{
return parseAddExpr();
}
AST::Expr* parseAddExpr()
{
if (AST::Expr *pExpr1 = parseMulExpr()) {
return parseAddExprRest(pExpr1);
} else return nullptr; // ERROR!
}
AST::Expr* parseAddExprRest(AST::Expr *pExpr1)
{
// right recursive rule for left associative operators
// -> can be done as iteration
for (;;) {
switch (_lookAhead.tk) {
case Token::Plus:
scan(); // consume token
if (AST::Expr *pExpr2 = parseMulExpr()) {
pExpr1 = new AST::ExprBinOpAdd(pExpr1, pExpr2);
} else {
delete pExpr1;
return nullptr; // ERROR!
}
break;
case Token::Minus:
scan(); // consume token
if (AST::Expr *pExpr2 = parseMulExpr()) {
pExpr1 = new AST::ExprBinOpSub(pExpr1, pExpr2);
} else {
delete pExpr1;
return nullptr; // ERROR!
}
break;
case Token::Error:
std::cerr << "SYNTAX ERROR: Unexpected character!" << std::endl;
delete pExpr1;
return nullptr;
default: return pExpr1;
}
}
}
AST::Expr* parseMulExpr()
{
if (AST::Expr *pExpr1 = parseUnExpr()) {
return parseMulExprRest(pExpr1);
} else return nullptr; // ERROR!
}
AST::Expr* parseMulExprRest(AST::Expr *pExpr1)
{
// right recursive rule for left associative operators
// -> can be done as iteration
for (;;) {
switch (_lookAhead.tk) {
case Token::Star:
scan(); // consume token
if (AST::Expr *pExpr2 = parseUnExpr()) {
pExpr1 = new AST::ExprBinOpMul(pExpr1, pExpr2);
} else {
delete pExpr1;
return nullptr; // ERROR!
}
break;
case Token::Slash:
scan(); // consume token
if (AST::Expr *pExpr2 = parseUnExpr()) {
pExpr1 = new AST::ExprBinOpDiv(pExpr1, pExpr2);
} else {
delete pExpr1;
return nullptr; // ERROR!
}
break;
case Token::Error:
std::cerr << "SYNTAX ERROR: Unexpected character!" << std::endl;
delete pExpr1;
return nullptr;
default: return pExpr1;
}
}
}
AST::Expr* parseUnExpr()
{
// right recursive rule for right associative operators
// -> must be done as recursion
switch (_lookAhead.tk) {
case Token::Plus:
scan(); // consume token
// as a unary plus has no effect it is simply ignored
return parseUnExpr();
case Token::Minus:
scan();
if (AST::Expr *pExpr = parseUnExpr()) {
return new AST::ExprUnOpNeg(pExpr);
} else return nullptr; // ERROR!
default:
return parsePrimExpr();
}
}
AST::Expr* parsePrimExpr()
{
AST::Expr *pExpr = nullptr;
switch (_lookAhead.tk) {
case Token::Number:
pExpr = new AST::ExprConst(_lookAhead.number);
scan(); // consume token
break;
case Token::Id: {
Var &var = _varTable[_lookAhead.id]; // find or create
pExpr = new AST::ExprVar(&var);
scan(); // consume token
} break;
case Token::LParen:
scan(); // consume token
if (!(pExpr = parseExpr())) return nullptr; // ERROR!
if (!match(Token::RParen)) {
delete pExpr; return nullptr; // ERROR!
}
break;
case Token::EOT:
std::cerr << "SYNTAX ERROR: Premature EOF!" << std::endl;
break;
case Token::Error:
std::cerr << "SYNTAX ERROR: Unexpected character!" << std::endl;
break;
default:
std::cerr << "SYNTAX ERROR: Unexpected token!" << std::endl;
}
return pExpr;
}
public:
// the parser function
static std::vector<AST::Expr*> parse(
std::istream &in, VarTable &varTable)
{
Parser parser(in, varTable);
return parser.parseProgram();
}
};
Basically, the parser consists essentially of a bunch of rule functions (according to the grammar rules) which are calling each other. The class around the rule functions is responsible to manage some global parser context. Hence, the only public method of class Parser
is
static std::vector<AST::Expr*> Parser::parse();
which constructs an instance (with the private constructor) and calls the function Parser::parseProgram()
corresponding to the start symbol program
.
Internally, the parser calls the Scanner::scan()
method to fill its look-ahead token.
This is done in Parser::scan()
which is called always when a token has to be consumed.
Looking closer, a pattern becomes visible how the rules have been translated into the parser functions:
Each non-terminal on the left side becomes a parse function. (Looking closer in the source code, you will realize that I didn't do this exactly. Some of the rules have been "inlined". – From my point of view, I inserted some extra rules to simplify the grammar which I didn't intend to transform from beginning. Sorry.)
Alternatives (
|
) are implemented asswitch (_lookAhead.tk)
. Thereby, each case label corresponds to the first terminal(s) (token(s)) to which the left most symbol may expand. (I believe that's why it is called "look-ahead parser" – decisions to apply rules are always done based on the look ahead token.) The dragon book has a subject about FIRST-FOLLOW sets which explains this in more detail.For terminal symbols,
Parser::scan()
is called. In special cases, it is replaced byParser::match()
if precisely one terminal (token) is expected.For non-terminal symbols, the call of the corresponding function is done.
Symbol sequences are simply done as sequences of the above calls.
The error-handling of this parser is the most simple I ever did. It could/should be done much more supporting (investing more effort, i.e. additional lines of code). (...but here I tried to keep it minimal...)
Put Together
For testing and demonstration, I prepared a main()
function with some built-in samples (source code of program and data to process):
// a sample application
using namespace std;
int main()
{
// the program:
const char *sourceCode =
"1 + 2 * 3 / 4 - 5;\n"
"a + b;\n"
"a - b;\n"
"a * b;\n"
"a / b;\n"
"a * (b + 3);\n";
// the variables
const char *vars[] = { "a", "b" };
enum { nVars = sizeof vars / sizeof *vars };
// the data
const double data[][nVars] = {
{ 4.0, 2.0 },
{ 2.0, 4.0 },
{ 10.0, 5.0 },
{ 42, 6 * 7 }
};
// compile program
stringstream in(sourceCode);
VarTable varTable;
vector<AST::Expr*> program = Parser::parse(in, varTable);
if (program.empty()) {
cerr << "ERROR: Compile failed!" << endl;
string line;
if (getline(in, line)) {
cerr << "Text at error: '" << line << "'" << endl;
}
return 1;
}
// apply program to the data
enum { nDataSets = sizeof data / sizeof *data };
for (size_t i = 0; i < nDataSets; ++i) {
const char *sep = "";
cout << "Data Set:" << endl;
for (size_t j = 0; j < nVars; ++j, sep = ", ") {
cout << sep << vars[j] << ": " << data[i][j];
}
cout << endl;
// load data
for (size_t j = 0; j < nVars; ++j) varTable[vars[j]].set(data[i][j]);
// perform program
cout << "Compute:" << endl;
istringstream in(sourceCode);
for (const AST::Expr *pExpr : program) {
string line; getline(in, line);
cout << line << ": " << pExpr->solve() << endl;
}
cout << endl;
}
// clear the program
for (AST::Expr *pExpr : program) delete pExpr;
program.clear();
// done
return 0;
}
I compiled and tested on VS2013 (Windows 10) and got:
Data Set:
a: 4, b: 2
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 6
a - b;: 2
a * b;: 8
a / b;: 2
a * (b + 3);: 20
Data Set:
a: 2, b: 4
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 6
a - b;: -2
a * b;: 8
a / b;: 0.5
a * (b + 3);: 14
Data Set:
a: 10, b: 5
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 15
a - b;: 5
a * b;: 50
a / b;: 2
a * (b + 3);: 80
Data Set:
a: 42, b: 42
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 84
a - b;: 0
a * b;: 1764
a / b;: 1
a * (b + 3);: 1890
Please, consider that the parser itself ignores any spaces and line-breaks. However, to make the sample output formatting simple, I have to use one semicolon-terminated expression per line. Otherwise, it will be difficult to associate the source code lines with the corresponding compiled expressions.
(Remember my note above about the Token
to which a source code reference (aka file position) might be added.)
The Complete Sample
...with source code and test run can be found on ideone.