使用git中的cparse库解析用户输入的字符串 [英] Parsing strings of user input using the cparse library from git

查看:75
本文介绍了使用git中的cparse库解析用户输入的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是c ++编程的新手,并希望使用在此处 https中找到的 cparse库 ://github.com/cparse/cparse 在我的项目中.我想解释用户输入的字符串,例如"a*(b+3)"(对于变量ab),并在不同的输入集上重复使用它作为函数.

例如,以一个文本文件作为输入,每行输入2个double数字,我的代码将在每行中写入一个结果为"a*(b+3)"的新文件(假设a是第一个数字,而b是第二个).

当我尝试从git中添加 cparse库时,出现了我的问题.我天真地遵循了安装说明(是git的新手):

$ cd cparse
$ make release

但是我不能像在Windows中那样使用make命令.我尝试下载zip文件并将.cpp.h文件直接复制到项目中,并在Visual Studio中使用"include existing" 选项,但是遇到了大量的编译器错误,并且可以不要让代码自己工作.

我是否以某种方式错过了重点?我如何使它工作?

失败了吗,还有另一种方法来解析用户输入的字符串并将其用作函数吗?

解决方案

失败了吗,还有另一种方法来解析用户输入的字符串并将其用作函数吗?

我想回答您的问题的这一部分,因为我觉得功能齐全的C解析器可能对于您的意图而言过于繁琐. (顺便说一句,一旦您运行了C解析器–如何处理其输出?动态链接?)

相反,我想向您展示如何自行构建一个简单的计算器(带有递归下降解析器).对于我将要使用的技术的文档,我热烈推荐编译器(原理,技术和工具),特别是第4章.

小型计算器项目

在下文中,我将逐部分描述我的示例解决方案.

语言设计

在开始编写编译器或解释器之前,定义一种可接受的语言是合理的.我想使用由以下组成的C:表达式的非常有限的子集:

  • C像浮点数(常数)
  • 类似C的标识符(变量)
  • 一元运算符+-
  • 二进制运算符+-*/
  • 括号()
  • 分号;(用于标记表达式的结尾,是必须的).

空格(包括换行符)将被简单地忽略,但可用于分隔事物并提高人类的可读性. C或C ++之类的注释(以及许多其他糖)我并未考虑将源代码保持在最小限度. (尽管如此,我得到了将近500行.)

OP的特定示例将适合该语言,并添加了分号:

a*(b+3);

将仅支持一种类型:double.因此,我不需要类型或任何使事情变得容易的声明.

在我开始草绘此语言的语法之前,我曾想过目标",并决定为...创建类.

抽象语法树

起初–一个存储变量的类:

// storage of variables

class Var {
  private:
    double _value;
  public:
    Var(): _value() { }
    ~Var() = default;
    double get() const { return _value; }
    void set(double value) { _value = value; }
};

变量为值提供存储,但不为标识符提供存储.后者是单独存储的,因为变量的实际使用不需要它,而只能按名称查找它:

typedef std::map<std::string, Var> VarTable;

使用std::map会自动创建变量.从许多高级语言中可以知道,变量在首次访问时就开始存在.

抽象语法树

以编程语言编写的源代码抽象句法结构的树形表示.树的每个节点表示源代码中出现的构造.

我从上面链接的Wikipedia文章中摘录了这段文字–不能说短一点.在以下有关AST的课程中:

// abstract syntax tree -> storage of "executable"

namespace AST {

class Expr {
  protected:
    Expr() = default;
  public:
    virtual ~Expr() = default;
  public:
    virtual double solve() const = 0;
};

class ExprConst: public Expr {
  private:
    double _value;
  public:
    ExprConst(double value): Expr(), _value(value) { }
    virtual ~ExprConst() = default;
    virtual double solve() const { return _value; }
};

class ExprVar: public Expr {
  private:
    Var *_pVar;
  public:
    ExprVar(Var *pVar): Expr(), _pVar(pVar) { }
    virtual ~ExprVar() = default;
    virtual double solve() const { return _pVar->get(); }
};

class ExprUnOp: public Expr {
  protected:
    Expr *_pArg1;
  protected:
    ExprUnOp(Expr *pArg1): Expr(), _pArg1(pArg1) { }
    virtual ~ExprUnOp() { delete _pArg1; }
};

class ExprUnOpNeg: public ExprUnOp {
  public:
    ExprUnOpNeg(Expr *pArg1): ExprUnOp(pArg1) { }
    virtual ~ExprUnOpNeg() = default;
    virtual double solve() const
    {
      return -_pArg1->solve();
    }
};

class ExprBinOp: public Expr {
  protected:
    Expr *_pArg1, *_pArg2;
  protected:
    ExprBinOp(Expr *pArg1, Expr *pArg2):
      Expr(), _pArg1(pArg1), _pArg2(pArg2)
    { }
    virtual ~ExprBinOp() { delete _pArg1; delete _pArg2; }
};

class ExprBinOpAdd: public ExprBinOp {
  public:
    ExprBinOpAdd(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
    virtual ~ExprBinOpAdd() = default;
    virtual double solve() const
    {
      return _pArg1->solve() + _pArg2->solve();
    }
};

class ExprBinOpSub: public ExprBinOp {
  public:
    ExprBinOpSub(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
    virtual ~ExprBinOpSub() = default;
    virtual double solve() const
    {
      return _pArg1->solve() - _pArg2->solve();
    }
};

class ExprBinOpMul: public ExprBinOp {
  public:
    ExprBinOpMul(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
    virtual ~ExprBinOpMul() = default;
    virtual double solve() const
    {
      return _pArg1->solve() * _pArg2->solve();
    }
};

class ExprBinOpDiv: public ExprBinOp {
  public:
    ExprBinOpDiv(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
    virtual ~ExprBinOpDiv() = default;
    virtual double solve() const
    {
      return _pArg1->solve() / _pArg2->solve();
    }
};

因此,使用AST类,样本a*(b+3)的表示应为

VarTable varTable;
Expr *pExpr
= new ExprBinOpMul(
    new ExprVar(&varTable["a"]),
    new ExprBinOpAdd(
      new ExprVar(&varTable["b"]),
      new ExprConst(3)));

注意:

没有从Expr派生的类来表示括号(),因为这完全没有必要.在构建树本身时会考虑对括号的处理.通常,具有较高优先级的运算符将成为具有较低优先级的运算符的子级.结果,前者先于后者计算.在上面的示例中,ExprBinOpAdd的实例是ExprBinOpMul的实例的子级(尽管乘法的优先级高于加法的优先级),这是通过适当考虑括号而得出的.

除了存储已解析的表达式外,还可以通过调用根节点的Expr::solve()方法来使用此树来计算表达式:

double result = pExpr->solve();

具有我们 Tiny Calculator 的后端,接下来是前端.

语法

正式语言最好由语法来描述.

program
  : expr Semicolon program
  | <empty>
  ;

expr
  : addExpr
  ;

addExpr
  : mulExpr addExprRest
  ;

addExprRest
  : addOp mulExpr addExprRest
  | <empty>
  ;

addOp
  : Plus | Minus
  ;

mulExpr
  : unaryExpr mulExprRest
  ;

mulExprRest
  : mulOp unaryExpr mulExprRest
  | <empty>
  ;

mulOp
  : Star | Slash
  ;

unaryExpr
  : unOp unaryExpr
  | primExpr
  ;

unOp
  : Plus | Minus
  ;

primExpr
  : Number
  | Id
  | LParen expr RParen
  ;

以开始符号program.

规则包含

  • 终端符号(以大写字母开头)和
  • 非终端符号(以小写字母开头)
  • 冒号(:),用于将左侧与右侧分开(左侧的非终止符可能会扩展为右侧的符号).
  • 竖线(|)用来分隔替代项
  • 一个<empty>符号,用于扩展为空(用于终止递归).

从终端符号中,我将得出扫描仪的令牌.

非终结符将转换为解析器函数.

addExprmulExpr的分离是有意进行的.因此,乘法运算符优先于加法运算符将被烧掉"语法本身.显然,浮点常量,变量标识符或括号中的表达式(primExpr接受)将具有最高优先级.

规则仅包含正确的递归.这是递归下降解析器的要求(正如我在Dragon书籍中从理论上学到的,并且从调试器的实践经验中学到的,直到我完全理解其原因为止).在递归下降解析器中实施左递归规则会导致未终止的递归,而递归最终会导致 StackOverflow .

扫描仪

通常将编译器分为扫描器和解析器.

扫描程序处理输入字符流,并将字符组合在一起成为令牌.令牌在解析器中用作终端符号.

对于令牌的输出,我制作了一个类.在我的专业项目中,它还会存储确切的文件位置以表示其来源.这很方便使用源代码引用以及错误消息和调试信息的任何输出来标记创建的对象. (...这里尽量将其保持在最小限度...)

// token class - produced in scanner, consumed in parser
struct Token {
  // tokens
  enum Tk {
    Plus, Minus, Star, Slash, LParen, RParen, Semicolon,
    Number, Id,
    EOT, Error
  };
  // token number
  Tk tk;
  // lexem as floating point number
  double number;
  // lexem as identifier
  std::string id;

  // constructors.
  explicit Token(Tk tk): tk(tk), number() { }
  explicit Token(double number): tk(Number), number(number) { }
  explicit Token(const std::string &id): tk(Id), number(), id(id) { }
};

有两个特殊令牌的枚举器:

  • EOT ...文本结尾(表示输入的结尾)
  • Error ...为不适合任何其他标记的任何字符生成.

令牌用作实际扫描仪的输出:

// the scanner - groups characters to tokens
class Scanner {
  private:
    std::istream &_in;
  public:
    // constructor.
    Scanner(std::istream &in): _in(in) { }
    /* groups characters to next token until the first character
     * which does not match (or end-of-file is reached).
     */
    Token scan()
    {
      char c;
      // skip white space
      do {
        if (!(_in >> c)) return Token(Token::EOT);
      } while (isspace(c));
      // classify character and build token
      switch (c) {
        case '+': return Token(Token::Plus);
        case '-': return Token(Token::Minus);
        case '*': return Token(Token::Star);
        case '/': return Token(Token::Slash);
        case '(': return Token(Token::LParen);
        case ')': return Token(Token::RParen);
        case ';': return Token(Token::Semicolon);
        default:
          if (isdigit(c)) {
            _in.unget(); double value; _in >> value;
            return Token(value);
          } else if (isalpha(c) || c == '_') {
            std::string id(1, c);
            while (_in >> c) {
              if (isalnum(c) || c == '_') id += c;
              else { _in.unget(); break; }
            }
            return Token(id);
          } else {
            _in.unget();
            return Token(Token::Error);
          }
      }
    }
};

解析器中使用了扫描仪.

解析器

class Parser {
  private:
    Scanner _scanner;
    VarTable &_varTable;
    Token _lookAhead;

  private:
    // constructor.
    Parser(std::istream &in, VarTable &varTable):
      _scanner(in), _varTable(varTable), _lookAhead(Token::EOT)
    {
      scan(); // load look ahead initially
    }

    // calls the scanner to read the next look ahead token.
    void scan() { _lookAhead = _scanner.scan(); }

    // consumes a specific token.
    bool match(Token::Tk tk)
    {
      if (_lookAhead.tk != tk) {
        std::cerr << "SYNTAX ERROR! Unexpected token!" << std::endl;
        return false;
      }
      scan();
      return true;
    }

    // the rules:

    std::vector<AST::Expr*> parseProgram()
    {
      // right recursive rule
      // -> can be done as iteration
      std::vector<AST::Expr*> pExprs;
      for (;;) {
        if (AST::Expr *pExpr = parseExpr()) {
          pExprs.push_back(pExpr);
        } else break;
        // special error checking for missing ';' (usual error)
        if (_lookAhead.tk != Token::Semicolon) {
          std::cerr << "SYNTAX ERROR: Semicolon expected!" << std::endl;
          break;
        }
        scan(); // consume semicolon
        if (_lookAhead.tk == Token::EOT) return pExprs;
      }
      // error handling
      for (AST::Expr *pExpr : pExprs) delete pExpr;
      pExprs.clear();
      return pExprs;
    }

    AST::Expr* parseExpr()
    {
      return parseAddExpr();
    }

    AST::Expr* parseAddExpr()
    {
      if (AST::Expr *pExpr1 = parseMulExpr()) {
        return parseAddExprRest(pExpr1);
      } else return nullptr; // ERROR!
    }

    AST::Expr* parseAddExprRest(AST::Expr *pExpr1)
    {
      // right recursive rule for left associative operators
      // -> can be done as iteration
      for (;;) {
        switch (_lookAhead.tk) {
          case Token::Plus:
            scan(); // consume token
            if (AST::Expr *pExpr2 = parseMulExpr()) {
              pExpr1 = new AST::ExprBinOpAdd(pExpr1, pExpr2);
            } else {
              delete pExpr1;
              return nullptr; // ERROR!
            }
            break;
          case Token::Minus:
            scan(); // consume token
            if (AST::Expr *pExpr2 = parseMulExpr()) {
              pExpr1 = new AST::ExprBinOpSub(pExpr1, pExpr2);
            } else {
              delete pExpr1;
              return nullptr; // ERROR!
            }
            break;
          case Token::Error:
            std::cerr << "SYNTAX ERROR: Unexpected character!" << std::endl;
            delete pExpr1;
            return nullptr;
          default: return pExpr1;
        }
      }
    }

    AST::Expr* parseMulExpr()
    {
      if (AST::Expr *pExpr1 = parseUnExpr()) {
        return parseMulExprRest(pExpr1);
      } else return nullptr; // ERROR!
    }

    AST::Expr* parseMulExprRest(AST::Expr *pExpr1)
    {
      // right recursive rule for left associative operators
      // -> can be done as iteration
      for (;;) {
        switch (_lookAhead.tk) {
          case Token::Star:
            scan(); // consume token
            if (AST::Expr *pExpr2 = parseUnExpr()) {
              pExpr1 = new AST::ExprBinOpMul(pExpr1, pExpr2);
            } else {
              delete pExpr1;
              return nullptr; // ERROR!
            }
            break;
          case Token::Slash:
            scan(); // consume token
            if (AST::Expr *pExpr2 = parseUnExpr()) {
              pExpr1 = new AST::ExprBinOpDiv(pExpr1, pExpr2);
            } else {
              delete pExpr1;
              return nullptr; // ERROR!
            }
            break;
          case Token::Error:
            std::cerr << "SYNTAX ERROR: Unexpected character!" << std::endl;
            delete pExpr1;
            return nullptr;
          default: return pExpr1;
        }
      }
    }

    AST::Expr* parseUnExpr()
    {
      // right recursive rule for right associative operators
      // -> must be done as recursion
      switch (_lookAhead.tk) {
        case Token::Plus:
          scan(); // consume token
          // as a unary plus has no effect it is simply ignored
          return parseUnExpr();
        case Token::Minus:
          scan();
          if (AST::Expr *pExpr = parseUnExpr()) {
            return new AST::ExprUnOpNeg(pExpr);
          } else return nullptr; // ERROR!
        default:
          return parsePrimExpr();
      }
    }

    AST::Expr* parsePrimExpr()
    {
      AST::Expr *pExpr = nullptr;
      switch (_lookAhead.tk) {
        case Token::Number:
          pExpr = new AST::ExprConst(_lookAhead.number);
          scan(); // consume token
          break;
        case Token::Id: {
          Var &var = _varTable[_lookAhead.id]; // find or create
          pExpr = new AST::ExprVar(&var);
          scan(); // consume token
        } break;
        case Token::LParen:
          scan(); // consume token
          if (!(pExpr = parseExpr())) return nullptr; // ERROR!
          if (!match(Token::RParen)) {
            delete pExpr; return nullptr; // ERROR!
          }
          break;
        case Token::EOT:
          std::cerr << "SYNTAX ERROR: Premature EOF!" << std::endl;
          break;
        case Token::Error:
          std::cerr << "SYNTAX ERROR: Unexpected character!" << std::endl;
          break;
        default:
          std::cerr << "SYNTAX ERROR: Unexpected token!" << std::endl;
      }
      return pExpr;
    }

  public:

    // the parser function
    static std::vector<AST::Expr*> parse(
      std::istream &in, VarTable &varTable)
    {
      Parser parser(in, varTable);
      return parser.parseProgram();
    }
};

基本上,解析器基本上由一堆相互调用的规则函数(根据语法规则)组成.围绕规则函数的类负责管理某些全局解析器上下文.因此,class Parser的唯一公共方法是

static std::vector<AST::Expr*> Parser::parse();

构造一个实例(使用私有构造函数),并调用与起始符号program相对应的函数Parser::parseProgram().

在内部,解析器调用Scanner::scan()方法来填充其提前标记.

这是在Parser::scan()中完成的,当必须使用令牌时,始终会调用它.

仔细观察,就会发现一种模式,这些规则是如何将规则转换为解析器功能的:

  • 左侧的每个非终端变为解析函数. (仔细查看源代码,您会发现我没有完全做到这一点.有些规则已内联".–从我的角度来看,我插入了一些额外的规则来简化我没有做过的语法不想从一开始就转型.抱歉.)

  • 替代项(|)被实现为switch (_lookAhead.tk).因此,每个案例标签对应于最左边的符号可以扩展到的第一终端(令牌). (我相信这就是为什么它被称为超前解析器"的原因-应用规则的决定总是基于前瞻标记.)龙书中有一个关于FIRST-FOLLOW集合的主题,对此进行了更详细的解释. /p>

  • 对于端子符号,调用Parser::scan().在特殊情况下,如果只需要一个终端(令牌),则将其替换为Parser::match().

  • 对于非终结符,将调用相应的函数.

  • 符号序列只是上述调用的序列.

这个解析器的错误处理是我做过的最简单的事情.可以/应该做更多的支持(投入更多的精力,即增加代码行). (...但是在这里,我试图将其保持在最小限度...)

放在一起

为了进行测试和演示,我准备了main()函数,其中包含一些内置示例(程序源代码和要处理的数据):

// a sample application

using namespace std;

int main()
{
  // the program:
  const char *sourceCode =
    "1 + 2 * 3 / 4 - 5;\n"
    "a + b;\n"
    "a - b;\n"
    "a * b;\n"
    "a / b;\n"
    "a * (b + 3);\n";
  // the variables
  const char *vars[] = { "a", "b" };
  enum { nVars = sizeof vars / sizeof *vars };
  // the data
  const double data[][nVars] = {
    { 4.0, 2.0 },
    { 2.0, 4.0 },
    { 10.0, 5.0 },
    { 42, 6 * 7 }
  };
  // compile program
  stringstream in(sourceCode);
  VarTable varTable;
  vector<AST::Expr*> program = Parser::parse(in, varTable);
  if (program.empty()) {
    cerr << "ERROR: Compile failed!" << endl;
    string line;
    if (getline(in, line)) {
      cerr << "Text at error: '" << line << "'" << endl;
    }
    return 1;
  }
  // apply program to the data
  enum { nDataSets = sizeof data / sizeof *data };
  for (size_t i = 0; i < nDataSets; ++i) {
    const char *sep = "";
    cout << "Data Set:" << endl;
    for (size_t j = 0; j < nVars; ++j, sep = ", ") {
      cout << sep << vars[j] << ": " << data[i][j];
    }
    cout << endl;
    // load data
    for (size_t j = 0; j < nVars; ++j) varTable[vars[j]].set(data[i][j]);
    // perform program
    cout << "Compute:" << endl;
    istringstream in(sourceCode);
    for (const AST::Expr *pExpr : program) {
      string line; getline(in, line);
      cout << line << ": " << pExpr->solve() << endl;
    }
    cout << endl;
  }
  // clear the program
  for (AST::Expr *pExpr : program) delete pExpr;
  program.clear();
  // done
  return 0;  
}

我在VS2013(Windows 10)上进行了编译和测试,并得到:

 Data Set:
a: 4, b: 2
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 6
a - b;: 2
a * b;: 8
a / b;: 2
a * (b + 3);: 20

Data Set:
a: 2, b: 4
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 6
a - b;: -2
a * b;: 8
a / b;: 0.5
a * (b + 3);: 14

Data Set:
a: 10, b: 5
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 15
a - b;: 5
a * b;: 50
a / b;: 2
a * (b + 3);: 80

Data Set:
a: 42, b: 42
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 84
a - b;: 0
a * b;: 1764
a / b;: 1
a * (b + 3);: 1890
 

请考虑解析器本身会忽略任何空格和换行符.但是,为了简化示例输出格式,我必须每行使用一个以分号结尾的表达式.否则,将很难将源代码行与相应的编译表达式相关联. (记住我上面关于Token可能会添加源代码引用(也就是文件位置)的注释.)

完整样本

...带有源代码和测试运行,可以在 ideone 上找到

I am new to c++ programming and wish to use the cparse library found here https://github.com/cparse/cparse in my project. I want to interpret strings of user input like "a*(b+3)" (for variables a and b) and use it as a function repeatedly on different sets of input.

For example, taking a text file as input with 2 double numbers per line my code will write a new file with the result of "a*(b+3)" on each line (assuming a is the first number and b is the second).

My problem arises when I try and include the cparse library from git. I followed the setup instructions naively (being new to git):

$ cd cparse
$ make release

But I cannot use the make command as I am using windows. I tried downloading the zip file and copying the .cpp and .h files into the project directly and using the "include existing" option in visual studio, but get a huge number of compiler errors and can't make the code work myself.

Am I missing the point somehow? How do I get it to work?

Failing that, is there another way to parse strings of user input and use them as functions?

解决方案

Failing that, is there another way to parse strings of user input and use them as functions?

I would like to answer this part of your question because I have the feeling that a full-featured C parser might be a little bit too heavy for what might be your intention. (Btw. once you got the C parser running – how to process its output? Dynamic linking?)

Instead, I want to show you how to build a simple calculator (with recursive descent parser) on your own. For the documentation of the techniques I will use, I warmly recommend Compilers (Principles, Techniques & Tools) by Aho, Lam, Sethi, Ullman (better known as "Dragon books") and especially Chapter 4.

The Tiny Calculator Project

In the following I describe my sample solution part by part.

Language Design

Before starting to write a compiler or interpreter, it's reasonable to define a language which shall be accepted. I want to use a very limited sub-set of C: expressions consisting of

  • C like floating point numbers (constants)
  • C like identifiers (variables)
  • unary operators + and -
  • binary operators +, -, *, and /
  • parentheses ()
  • semicolons ; (to mark the end of an expression, mandatory).

Whitespaces (including line-breaks) will be simply ignored but may be used to separate things as well as to improve human readability. C or C++ like comments (and a lot of other sugar) I didn't consider to keep the source code as minimal as possible. (For all that, I got nearly 500 lines.)

The specific example of the OP will fit into this language with an added semicolon:

a*(b+3);

There will be only one type supported: double. Thus, I don't need types or any declaration which makes things easier.

Before I started to sketch the grammar of this language, I was thinking about the "target" of compiling and decided to make classes for...

An Abstract Syntax Tree

At first – a class to store variables:

// storage of variables

class Var {
  private:
    double _value;
  public:
    Var(): _value() { }
    ~Var() = default;
    double get() const { return _value; }
    void set(double value) { _value = value; }
};

The variables provide storage for a value but not for the identifier. The latter is stored separately as it is not needed for the actual usage of a variable but only to find it by name:

typedef std::map<std::string, Var> VarTable;

The usage of a std::map automates the creation of a variable. As known from many high level languages, a variable starts to exist when its accessed the first time.

The Abstract Syntax Tree is a

a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code.

I took this text from the above linked Wikipedia article – couldn't said it shorter. In the following my classes for the AST:

// abstract syntax tree -> storage of "executable"

namespace AST {

class Expr {
  protected:
    Expr() = default;
  public:
    virtual ~Expr() = default;
  public:
    virtual double solve() const = 0;
};

class ExprConst: public Expr {
  private:
    double _value;
  public:
    ExprConst(double value): Expr(), _value(value) { }
    virtual ~ExprConst() = default;
    virtual double solve() const { return _value; }
};

class ExprVar: public Expr {
  private:
    Var *_pVar;
  public:
    ExprVar(Var *pVar): Expr(), _pVar(pVar) { }
    virtual ~ExprVar() = default;
    virtual double solve() const { return _pVar->get(); }
};

class ExprUnOp: public Expr {
  protected:
    Expr *_pArg1;
  protected:
    ExprUnOp(Expr *pArg1): Expr(), _pArg1(pArg1) { }
    virtual ~ExprUnOp() { delete _pArg1; }
};

class ExprUnOpNeg: public ExprUnOp {
  public:
    ExprUnOpNeg(Expr *pArg1): ExprUnOp(pArg1) { }
    virtual ~ExprUnOpNeg() = default;
    virtual double solve() const
    {
      return -_pArg1->solve();
    }
};

class ExprBinOp: public Expr {
  protected:
    Expr *_pArg1, *_pArg2;
  protected:
    ExprBinOp(Expr *pArg1, Expr *pArg2):
      Expr(), _pArg1(pArg1), _pArg2(pArg2)
    { }
    virtual ~ExprBinOp() { delete _pArg1; delete _pArg2; }
};

class ExprBinOpAdd: public ExprBinOp {
  public:
    ExprBinOpAdd(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
    virtual ~ExprBinOpAdd() = default;
    virtual double solve() const
    {
      return _pArg1->solve() + _pArg2->solve();
    }
};

class ExprBinOpSub: public ExprBinOp {
  public:
    ExprBinOpSub(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
    virtual ~ExprBinOpSub() = default;
    virtual double solve() const
    {
      return _pArg1->solve() - _pArg2->solve();
    }
};

class ExprBinOpMul: public ExprBinOp {
  public:
    ExprBinOpMul(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
    virtual ~ExprBinOpMul() = default;
    virtual double solve() const
    {
      return _pArg1->solve() * _pArg2->solve();
    }
};

class ExprBinOpDiv: public ExprBinOp {
  public:
    ExprBinOpDiv(Expr *pArg1, Expr *pArg2): ExprBinOp(pArg1, pArg2) { }
    virtual ~ExprBinOpDiv() = default;
    virtual double solve() const
    {
      return _pArg1->solve() / _pArg2->solve();
    }
};

Thus, using the AST classes, the representation of sample a*(b+3) would be

VarTable varTable;
Expr *pExpr
= new ExprBinOpMul(
    new ExprVar(&varTable["a"]),
    new ExprBinOpAdd(
      new ExprVar(&varTable["b"]),
      new ExprConst(3)));

Note:

There is no class derived from Expr to represent the parentheses () because this is simply not necessary. The processing of parentheses is considered when building the tree itself. Normally, the operators with higher precedence become children of the operators with lower precedence. As a result, the former are computed before the latter. In the above example, the instance of ExprBinOpAdd is a child of the instance of ExprBinOpMul (although precedence of multiply is higher than precedence of add) which results from the proper consideration of the parentheses.

Beside of storing a parsed expression, this tree can be used to compute the expression by calling the Expr::solve() method of the root node:

double result = pExpr->solve();

Having a backend for our Tiny Calculator, next is the front-end.

A Grammar

A formal language is best described by a grammar.

program
  : expr Semicolon program
  | <empty>
  ;

expr
  : addExpr
  ;

addExpr
  : mulExpr addExprRest
  ;

addExprRest
  : addOp mulExpr addExprRest
  | <empty>
  ;

addOp
  : Plus | Minus
  ;

mulExpr
  : unaryExpr mulExprRest
  ;

mulExprRest
  : mulOp unaryExpr mulExprRest
  | <empty>
  ;

mulOp
  : Star | Slash
  ;

unaryExpr
  : unOp unaryExpr
  | primExpr
  ;

unOp
  : Plus | Minus
  ;

primExpr
  : Number
  | Id
  | LParen expr RParen
  ;

with start symbol program.

The rules contain

  • terminal symbols (starting with uppercase letter) and
  • non-terminal symbols (starting with lowercase)
  • a colon (:) to separate left side from right side (non-terminal on left side may expand to the symbols on the right side).
  • vertical bars (|) to separate alternatives
  • an <empty> symbol for expanding to nothing (used to terminate recursions).

From the terminal symbols, I will derive the tokens for the scanner.

The non-terminal symbols will be transformed into the parser functions.

The separation of addExpr and mulExpr has been done intentionally. Thus, the precedence of multiplicative operators over additive operators will be "burnt in" the grammar itself. Obviously, the floating point constants, variable identifiers, or expressions in parentheses (accepted in primExpr) will have highest precedence.

The rules contain only right recursions. This is a requirement for recursive-descent parsers (as I've learnt theoretically in the Dragon books and from practical experiences in the debugger until I fully understood the reason why). Implementing a left recursive rule in a recursive-descent parser results in non-terminated recursions which, in turn, end up with a StackOverflow.

The Scanner

It's usual to split the compiler in a scanner and a parser.

The Scanner processes the input character stream and groups characters together to tokens. The tokens are used as terminal symbols in the parser.

For the output of tokens, I made a class. In my professional projects, it stores additionally the exact file position to denote its origin. This is convenient to tag created objects with source code references as well as any output of error messages and debugging info. (...left out here to keep it as minimal as possible...)

// token class - produced in scanner, consumed in parser
struct Token {
  // tokens
  enum Tk {
    Plus, Minus, Star, Slash, LParen, RParen, Semicolon,
    Number, Id,
    EOT, Error
  };
  // token number
  Tk tk;
  // lexem as floating point number
  double number;
  // lexem as identifier
  std::string id;

  // constructors.
  explicit Token(Tk tk): tk(tk), number() { }
  explicit Token(double number): tk(Number), number(number) { }
  explicit Token(const std::string &id): tk(Id), number(), id(id) { }
};

There are two enumerators for special tokens:

  • EOT ... end of text (remarks the end of input)
  • Error ... generated for any character which does not fit in any other token.

The tokens are used as output of the actual scanner:

// the scanner - groups characters to tokens
class Scanner {
  private:
    std::istream &_in;
  public:
    // constructor.
    Scanner(std::istream &in): _in(in) { }
    /* groups characters to next token until the first character
     * which does not match (or end-of-file is reached).
     */
    Token scan()
    {
      char c;
      // skip white space
      do {
        if (!(_in >> c)) return Token(Token::EOT);
      } while (isspace(c));
      // classify character and build token
      switch (c) {
        case '+': return Token(Token::Plus);
        case '-': return Token(Token::Minus);
        case '*': return Token(Token::Star);
        case '/': return Token(Token::Slash);
        case '(': return Token(Token::LParen);
        case ')': return Token(Token::RParen);
        case ';': return Token(Token::Semicolon);
        default:
          if (isdigit(c)) {
            _in.unget(); double value; _in >> value;
            return Token(value);
          } else if (isalpha(c) || c == '_') {
            std::string id(1, c);
            while (_in >> c) {
              if (isalnum(c) || c == '_') id += c;
              else { _in.unget(); break; }
            }
            return Token(id);
          } else {
            _in.unget();
            return Token(Token::Error);
          }
      }
    }
};

The scanner is used in the parser.

The Parser

class Parser {
  private:
    Scanner _scanner;
    VarTable &_varTable;
    Token _lookAhead;

  private:
    // constructor.
    Parser(std::istream &in, VarTable &varTable):
      _scanner(in), _varTable(varTable), _lookAhead(Token::EOT)
    {
      scan(); // load look ahead initially
    }

    // calls the scanner to read the next look ahead token.
    void scan() { _lookAhead = _scanner.scan(); }

    // consumes a specific token.
    bool match(Token::Tk tk)
    {
      if (_lookAhead.tk != tk) {
        std::cerr << "SYNTAX ERROR! Unexpected token!" << std::endl;
        return false;
      }
      scan();
      return true;
    }

    // the rules:

    std::vector<AST::Expr*> parseProgram()
    {
      // right recursive rule
      // -> can be done as iteration
      std::vector<AST::Expr*> pExprs;
      for (;;) {
        if (AST::Expr *pExpr = parseExpr()) {
          pExprs.push_back(pExpr);
        } else break;
        // special error checking for missing ';' (usual error)
        if (_lookAhead.tk != Token::Semicolon) {
          std::cerr << "SYNTAX ERROR: Semicolon expected!" << std::endl;
          break;
        }
        scan(); // consume semicolon
        if (_lookAhead.tk == Token::EOT) return pExprs;
      }
      // error handling
      for (AST::Expr *pExpr : pExprs) delete pExpr;
      pExprs.clear();
      return pExprs;
    }

    AST::Expr* parseExpr()
    {
      return parseAddExpr();
    }

    AST::Expr* parseAddExpr()
    {
      if (AST::Expr *pExpr1 = parseMulExpr()) {
        return parseAddExprRest(pExpr1);
      } else return nullptr; // ERROR!
    }

    AST::Expr* parseAddExprRest(AST::Expr *pExpr1)
    {
      // right recursive rule for left associative operators
      // -> can be done as iteration
      for (;;) {
        switch (_lookAhead.tk) {
          case Token::Plus:
            scan(); // consume token
            if (AST::Expr *pExpr2 = parseMulExpr()) {
              pExpr1 = new AST::ExprBinOpAdd(pExpr1, pExpr2);
            } else {
              delete pExpr1;
              return nullptr; // ERROR!
            }
            break;
          case Token::Minus:
            scan(); // consume token
            if (AST::Expr *pExpr2 = parseMulExpr()) {
              pExpr1 = new AST::ExprBinOpSub(pExpr1, pExpr2);
            } else {
              delete pExpr1;
              return nullptr; // ERROR!
            }
            break;
          case Token::Error:
            std::cerr << "SYNTAX ERROR: Unexpected character!" << std::endl;
            delete pExpr1;
            return nullptr;
          default: return pExpr1;
        }
      }
    }

    AST::Expr* parseMulExpr()
    {
      if (AST::Expr *pExpr1 = parseUnExpr()) {
        return parseMulExprRest(pExpr1);
      } else return nullptr; // ERROR!
    }

    AST::Expr* parseMulExprRest(AST::Expr *pExpr1)
    {
      // right recursive rule for left associative operators
      // -> can be done as iteration
      for (;;) {
        switch (_lookAhead.tk) {
          case Token::Star:
            scan(); // consume token
            if (AST::Expr *pExpr2 = parseUnExpr()) {
              pExpr1 = new AST::ExprBinOpMul(pExpr1, pExpr2);
            } else {
              delete pExpr1;
              return nullptr; // ERROR!
            }
            break;
          case Token::Slash:
            scan(); // consume token
            if (AST::Expr *pExpr2 = parseUnExpr()) {
              pExpr1 = new AST::ExprBinOpDiv(pExpr1, pExpr2);
            } else {
              delete pExpr1;
              return nullptr; // ERROR!
            }
            break;
          case Token::Error:
            std::cerr << "SYNTAX ERROR: Unexpected character!" << std::endl;
            delete pExpr1;
            return nullptr;
          default: return pExpr1;
        }
      }
    }

    AST::Expr* parseUnExpr()
    {
      // right recursive rule for right associative operators
      // -> must be done as recursion
      switch (_lookAhead.tk) {
        case Token::Plus:
          scan(); // consume token
          // as a unary plus has no effect it is simply ignored
          return parseUnExpr();
        case Token::Minus:
          scan();
          if (AST::Expr *pExpr = parseUnExpr()) {
            return new AST::ExprUnOpNeg(pExpr);
          } else return nullptr; // ERROR!
        default:
          return parsePrimExpr();
      }
    }

    AST::Expr* parsePrimExpr()
    {
      AST::Expr *pExpr = nullptr;
      switch (_lookAhead.tk) {
        case Token::Number:
          pExpr = new AST::ExprConst(_lookAhead.number);
          scan(); // consume token
          break;
        case Token::Id: {
          Var &var = _varTable[_lookAhead.id]; // find or create
          pExpr = new AST::ExprVar(&var);
          scan(); // consume token
        } break;
        case Token::LParen:
          scan(); // consume token
          if (!(pExpr = parseExpr())) return nullptr; // ERROR!
          if (!match(Token::RParen)) {
            delete pExpr; return nullptr; // ERROR!
          }
          break;
        case Token::EOT:
          std::cerr << "SYNTAX ERROR: Premature EOF!" << std::endl;
          break;
        case Token::Error:
          std::cerr << "SYNTAX ERROR: Unexpected character!" << std::endl;
          break;
        default:
          std::cerr << "SYNTAX ERROR: Unexpected token!" << std::endl;
      }
      return pExpr;
    }

  public:

    // the parser function
    static std::vector<AST::Expr*> parse(
      std::istream &in, VarTable &varTable)
    {
      Parser parser(in, varTable);
      return parser.parseProgram();
    }
};

Basically, the parser consists essentially of a bunch of rule functions (according to the grammar rules) which are calling each other. The class around the rule functions is responsible to manage some global parser context. Hence, the only public method of class Parser is

static std::vector<AST::Expr*> Parser::parse();

which constructs an instance (with the private constructor) and calls the function Parser::parseProgram() corresponding to the start symbol program.

Internally, the parser calls the Scanner::scan() method to fill its look-ahead token.

This is done in Parser::scan() which is called always when a token has to be consumed.

Looking closer, a pattern becomes visible how the rules have been translated into the parser functions:

  • Each non-terminal on the left side becomes a parse function. (Looking closer in the source code, you will realize that I didn't do this exactly. Some of the rules have been "inlined". – From my point of view, I inserted some extra rules to simplify the grammar which I didn't intend to transform from beginning. Sorry.)

  • Alternatives (|) are implemented as switch (_lookAhead.tk). Thereby, each case label corresponds to the first terminal(s) (token(s)) to which the left most symbol may expand. (I believe that's why it is called "look-ahead parser" – decisions to apply rules are always done based on the look ahead token.) The dragon book has a subject about FIRST-FOLLOW sets which explains this in more detail.

  • For terminal symbols, Parser::scan() is called. In special cases, it is replaced by Parser::match() if precisely one terminal (token) is expected.

  • For non-terminal symbols, the call of the corresponding function is done.

  • Symbol sequences are simply done as sequences of the above calls.

The error-handling of this parser is the most simple I ever did. It could/should be done much more supporting (investing more effort, i.e. additional lines of code). (...but here I tried to keep it minimal...)

Put Together

For testing and demonstration, I prepared a main() function with some built-in samples (source code of program and data to process):

// a sample application

using namespace std;

int main()
{
  // the program:
  const char *sourceCode =
    "1 + 2 * 3 / 4 - 5;\n"
    "a + b;\n"
    "a - b;\n"
    "a * b;\n"
    "a / b;\n"
    "a * (b + 3);\n";
  // the variables
  const char *vars[] = { "a", "b" };
  enum { nVars = sizeof vars / sizeof *vars };
  // the data
  const double data[][nVars] = {
    { 4.0, 2.0 },
    { 2.0, 4.0 },
    { 10.0, 5.0 },
    { 42, 6 * 7 }
  };
  // compile program
  stringstream in(sourceCode);
  VarTable varTable;
  vector<AST::Expr*> program = Parser::parse(in, varTable);
  if (program.empty()) {
    cerr << "ERROR: Compile failed!" << endl;
    string line;
    if (getline(in, line)) {
      cerr << "Text at error: '" << line << "'" << endl;
    }
    return 1;
  }
  // apply program to the data
  enum { nDataSets = sizeof data / sizeof *data };
  for (size_t i = 0; i < nDataSets; ++i) {
    const char *sep = "";
    cout << "Data Set:" << endl;
    for (size_t j = 0; j < nVars; ++j, sep = ", ") {
      cout << sep << vars[j] << ": " << data[i][j];
    }
    cout << endl;
    // load data
    for (size_t j = 0; j < nVars; ++j) varTable[vars[j]].set(data[i][j]);
    // perform program
    cout << "Compute:" << endl;
    istringstream in(sourceCode);
    for (const AST::Expr *pExpr : program) {
      string line; getline(in, line);
      cout << line << ": " << pExpr->solve() << endl;
    }
    cout << endl;
  }
  // clear the program
  for (AST::Expr *pExpr : program) delete pExpr;
  program.clear();
  // done
  return 0;  
}

I compiled and tested on VS2013 (Windows 10) and got:

Data Set:
a: 4, b: 2
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 6
a - b;: 2
a * b;: 8
a / b;: 2
a * (b + 3);: 20

Data Set:
a: 2, b: 4
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 6
a - b;: -2
a * b;: 8
a / b;: 0.5
a * (b + 3);: 14

Data Set:
a: 10, b: 5
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 15
a - b;: 5
a * b;: 50
a / b;: 2
a * (b + 3);: 80

Data Set:
a: 42, b: 42
Compute:
1 + 2 * 3 / 4 - 5;: -2.5
a + b;: 84
a - b;: 0
a * b;: 1764
a / b;: 1
a * (b + 3);: 1890

Please, consider that the parser itself ignores any spaces and line-breaks. However, to make the sample output formatting simple, I have to use one semicolon-terminated expression per line. Otherwise, it will be difficult to associate the source code lines with the corresponding compiled expressions. (Remember my note above about the Token to which a source code reference (aka file position) might be added.)

The Complete Sample

...with source code and test run can be found on ideone.

这篇关于使用git中的cparse库解析用户输入的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆