使用标记列表构建抽象语法树 [英] Constructing an Abstract Syntax Tree with a list of Tokens

查看:33
本文介绍了使用标记列表构建抽象语法树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从令牌列表中构建一个 AST.我正在制作一种脚本语言,并且已经完成了词法分析部分,但是我不知道如何创建 AST.所以问题是,我该如何处理这样的事情:

I want to construct an AST from a list of tokens. I'm making a scripting language and I've already done the lexical analysis part, but I have no idea how to create an AST. So the question is, how do I take something like this:

WORD, int
WORD, x
SYMBOL, =
NUMBER, 5
SYMBOL, ;

并将其转换为抽象语法树?最好,我希望没有像ANTLR这样的库或其他任何东西,我宁愿自己尝试从头开始.但是,如果这是一项非常复杂的任务,我不介意使用库:) 谢谢

and convert it into an Abstract Syntax Tree? Preferably, I'd like to do so without a library like ANTLR or whatever, I'd rather try and do it from scratch myself. However, if it's a really complex task, I don't mind using a library :) Thanks

推荐答案

基本的技巧是认识到解析,无论如何完成,都是以增量步骤发生的,包括一个一个地读取标记.

The fundamental trick is to recognize that parsing, however accomplished, happens in incremental steps, including the reading of the tokens one by one.

在每个增量步骤中,都有机会通过组合其他增量步骤构建的 AST 片段来构建部分 AST.这是一个递归的想法,它在为被扫描的令牌构建 AST 叶节点时达到最低点.这个基本思想几乎出现在所有构建 AST 的解析器中.

At each incremental step, there is an opportunity to build part of the AST by combining AST fragments built by other incremental steps. This is a recursive idea, and it bottoms out in building AST leaf nodes for tokens as they are scanned. This basic idea occurs in pretty much all AST-building parsers.

如果构建一个递归下降解析器,那么实际上构建了一个递归过程的协作系统,每个递归过程都识别出正在实现的任何语法中的非终结符.对于纯解析,每个过程只返回一个布尔值来表示非终结符(未识别)".

If one builds a recursive descent parser, one in effect builds a cooperating system of recursive procedures, each one of which recognizes a nonterminal in whatever grammar is being implemented. For pure parsing, each procedure simply returns a boolean for "nonterminal (not) recognized".

为了使用递归下降解析器构建 AST,设计这些过程以返回两个值:已识别"的布尔值,以及(如果被识别)为非终结符(以某种方式)构造的 AST.(一个常见的 hack 是返回一个指针,对于未识别"是无效的,或者如果已识别"则指向构造的 AST).构建单个过程的结果 AST 的方式是组合来自它调用的子过程的 AST.这对于叶子过程来说是非常简单的,它读取输入标记并可以立即构建一棵树.

To build an AST with a recursive descent parser, one designs these procedures to return two values: the boolean "recognized", and, if recognized, an AST constructed (somehow) for the nonterminal. (A common hack is return a pointer, which is void for "not recognized", or points to the constructed AST if "recognized"). The way the resulting AST for a single procedure is built, is by combining the ASTs from the sub-procedures that it invokes. This is pretty trivial to do for leaf procedures, which read an input token and can immediately build a tree.

所有这些的缺点是必须手动编码递归下降,并通过树构建步骤对其进行扩充.总的来说,这实际上很容易为小型语法编写代码.

The downside to all this is one must manually code the recursive descent, and augment it with the tree building steps. In the grand scheme of things, this is actually pretty easy to code for small grammars.

对于 OP 的示例,假设我们有以下语法:

For OP's example, assume we have this grammar:

GOAL = ASSIGNMENT 
ASSIGNMENT = LHS '=' RHS ';' 
LHS = IDENTIFIER 
RHS = IDENTIFIER | NUMBER

好的,我们的递归下降解析器:

OK, our recursive descent parser:

boolean parse_Goal()
{  if parse_Assignement()
   then return true
   else return false
}

boolean parse_Assignment()
{  if not Parse_LHS()
   then return false
   if not Parse_equalsign()
   then throw SyntaxError // because there are no viable alternatives from here
   if not Parse_RHS()
   then throw SyntaxError
   if not Parse_semicolon()
   the throw SyntaxError
   return true
}

boolean parse_LHS()
{  if parse_IDENTIFIER()
   then return true
   else return false
}

boolean parse_RHS()
{  if parse_IDENTIFIER()
   then return true
   if parse_NUMBER()
   then return true
   else return false
}

boolean parse_equalsign()
{  if TestInputAndAdvance("=")  // this can check for token instead
   then return true
   else return false
}

boolean parse_semicolon()
{  if TestInputAndAdvance(";")
   then return true
   else return false
}

boolean parse_IDENTIFIER()
{  if TestInputForIdentifier()
   then return true
   else return false
}

boolean parse_NUMBER()
{  if TestInputForNumber()
   then return true
   else return false
}

现在,让我们修改它构建一个抽象语法树:

Now, let's revise it build a abstract syntax tree:

AST* parse_Goal() // note: we choose to return a null pointer for "false"
{  node = parse_Assignment()
   if node != NULL
   then return node
   else return NULL
}

AST* parse_Assignment()
{  LHSnode = Parse_LHS()
   if LHSnode == NULL
   then return NULL
   EqualNode = Parse_equalsign()
   if EqualNode == NULL
   then throw SyntaxError // because there are no viable alternatives from here
   RHSnode = Parse_RHS()
   if RHSnode == NULL
   then throw SyntaxError
   SemicolonNode = Parse_semicolon()
   if SemicolonNode == NULL
   the throw SyntaxError
   return makeASTNode(ASSIGNMENT,LHSNode,RHSNode)
}

AST* parse_LHS()
{  IdentifierNode = parse_IDENTIFIER()
   if node != NULL
   then return IdentifierNode
   else return NULL
}

AST* parse_RHS()
{  RHSnode = parse_IDENTIFIER()
   if RHSnode != null
   then return RHSnode
   RHSnode = parse_NUMBER()
   if RHSnode != null
   then return RHSnode
   else return NULL
}

AST* parse_equalsign()
{  if TestInputAndAdvance("=")  // this can check for token instead
   then return makeASTNode("=")
   else return NULL
}

AST* parse_semicolon()
{  if TestInputAndAdvance(";")
   then return makeASTNode(";")
   else return NULL
}

AST* parse_IDENTIFIER()
{  text = TestInputForIdentifier()
   if text != NULL
   then return makeASTNode("IDENTIFIER",text)
   else return NULL
}

AST* parse_NUMBER()
{  text = TestInputForNumber()
   if text != NULL
   then return makeASTNode("NUMBER",text)
   else return NULL
}

我显然掩盖了一些细节,但我认为读者填写它们不会有问题.

I've obviously glossed over some details, but I assume the reader will have no trouble filling them in.

JavaCC 和 ANTLR 等解析器生成器工具基本上会生成递归下降解析器,并且具有构建与此非常相似的树的功能.

Parser generator tools like JavaCC and ANTLR basically generate recursive descent parsers, and have facilities for constructing trees that work very much like this.

构建自底向上解析器(YACC、Bison、GLR 等)的解析器生成器工具也构建相同风格的 AST 节点.但是,没有一组递归函数;取而代之的是,这些工具可以管理一堆可见的和简化为非终结符的令牌.AST 节点构建在并行堆栈上;当发生归约时,被归约覆盖的堆栈部分上的 AST 节点被组合以产生一个非终端 AST 节点来替换它们.这种情况发生在语法规则的零大小"堆栈段中,这些堆栈段也是空的,导致 AST 节点(通常用于空列表"或缺失选项")似乎无处可见.

Parser generator tools that build bottom-up parsers (YACC, Bison, GLR, ...) also build AST nodes in the same style. However, there is no set of recursive functions; instead, a stack of tokens seen and reduced-to nonterminals is managed by these tools. The AST nodes are constructed on a parallel stack; when a reduction occurs, the AST nodes on the part of the stack covered by the reduction are combined to produce a nonterminal AST node to replace them. This happens with "zero-size" stack segments for grammar rules which are empty too causing AST nodes (typically for 'empty list' or 'missing option') to seemingly appear from nowhere.

使用 bitty 语言,编写构建树的递归下降解析器非常实用.

With bitty languages, writing recursive-descent parsers that build trees is pretty practical.

真正的语言(无论是像 COBOL 那样陈旧陈旧,还是像 Scala 那样炙手可热)的一个问题是语法规则的数量非常多,而且由于语言的复杂性和对任何语言委员会的坚持而变得复杂它不断添加其他语言提供的新优点(语言嫉妒",请参阅 Java、C# 和 C++ 之间的进化竞赛).现在编写递归下降解析器变得一发不可收拾,人们倾向于使用解析器生成器.但即使使用解析器生成器,编写所有自定义代码来构建 AST 节点也是一场大战(我们还没有讨论设计一个好的抽象"语法需要什么,而不是首先想到的东西).随着规模和持续发展,维护语法规则和 AST 构建 goo 变得越来越困难.(如果你的语言成功了,一年之内你会想要改变它).因此,即使编写 AST 构建规则也会变得笨拙.

A problem with real languages (whether old and hoary like COBOL or hot and shiny like Scala) is that the number of grammar rules is pretty large, complicated by the sophistication of the language and the insistence on whatever language committee is in charge of it to perpetually add new goodies offered by other languages ("language envy", see the evolutionary race between Java, C# and C++). Now writing a recursive descent parser gets way out of hand and one tends to use parser generators. But even with a parser generator, writing all the custom code to build AST nodes is also a big battle (and we haven't discussed what it takes to design a good "abstract" syntax vs. the first thing that comes to mind). Maintaining grammar rules and AST building goo gets progressively harder with scale and ongoing evolution. (If your language is successful, within a year you'll want to change it). So even writing the AST building rules gets awkward.

理想情况下,人们只想写一个语法,然后得到一个解析器和树.你可以使用一些最新的解析器生成器来做到这一点:我们的 DMS 软件再造工具包接受完整的上下文无关语法,并自动构建一个 AST,无需语法工程师的工作;自 1995 年以来一直在这样做.ANTLR 人员终于在 2014 年发现了这一点,现在 ANTLR4 提供了这样的选项.

Ideally, one would just like to write a grammar, and get a parser and tree. You can do this with some recent parser generators: Our DMS Software Reengineering Toolkit accepts full context free grammars, and automatically constructs an AST, no work on the grammar engineer's part; its been doing this since 1995. The ANTLR guys finally figured this out in 2014, and ANTLR4 now offers an option like this.

最后一点:拥有解析器(即使使用 AST)几乎无法解决您要解决的实际问题,无论它是什么.它只是一个基础部分,令大多数解析器新手感到震惊,它是操作代码工具的最小部分.谷歌我关于解析后的生活的文章(或查看我的简历)了解更多细节.

Last point: having a parser (even with an AST) is hardly a solution to the actual problem you set out to solve, whatever it was. Its just a foundation piece, and much to the shock for most parser-newbies, it is the smallest part to a tool that manipulates code. Google my essay on Life After Parsing (or check my bio) for more detail.

这篇关于使用标记列表构建抽象语法树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆