用 C++ 生成 AST [英] Generate an AST in C++

查看:62
本文介绍了用 C++ 生成 AST的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用 C++ 制作一个解释器,到目前为止我已经有了我的词法分析器来生成标记.问题是我不确定如何生成遍历"解析树.

I'm making an interpreter in C++, so far I've got my lexer to generate tokens. The problem is I'm not sure how to generate an "walk" a parse tree.

我正在考虑使用一组数组来制作我的解析树,但我不确定如何以正确的顺序将标记实际插入到解析树中.

I was thinking of making my parse tree using an array of arrays, but I'm not sure how to actually insert the tokens into the parse tree in the correct order.

我不确定是自上而下、从左到右还是自下而上、从右到左.

I'm not sure whether or not to go top-down, left-right or bottom-up, right-left.

谁能给我一个简单的 LALR(1) 算法?

Can anyone provide me with a simple LALR(1) algorithm?

推荐答案

我将在这里违背传统观点,并说您应该使用特定于自然语言的数据结构手动构建 AST.

I'm going to go against conventional wisdom here and say that you should build your AST manually with natural language-specific data-structures.

通用的AST 数据结构"过于通用,无法轻松使用——使用 AST 来做任何有用的事情的代码将被访问它想要的数据的变通方法所掩盖.如果您走这条路线(或使用解析器生成器),您可能最终会构建一个翻译层,将通用结构转换为对您的语言真正有意义的 AST.为什么不避开中间人,直接构建最终的数据结构?

A generic "AST data structure" will be too general to work with comfortably -- the code that consumes the AST to do anything useful with it will be obscured by the workarounds to access the data it wants. If you go this route (or use a parser generator), you'll likely end up building a translation layer to go from the generic structure to an AST that actually makes sense for your language. Why not avoid the middleman and build the final data structure directly?

我建议编写第一个句法传递,它使用每个可能的构造所需的标记(可能根据需要提前查看一个标记).这种句法传递将从数据结构的链接实例中构建 AST,这些数据结构表示您的语言中的每个可能的构造.例如,如果您的程序可以由一系列语句组成,其中每个语句可以是函数声明或函数调用,您可以创建以下数据结构:

I suggest writing a first syntactic pass that consumes the tokens it needs for each possible construct (probably peeking ahead one token as needed). This syntactic pass would construct the AST out of linked instances of data structures that represent each possible construct in your language. For example, if your program can consist of a series of statements, where each statement can be either a function declaration or a function call, you could create the following data structures:

AST (struct)
    -> std::vector<Statement*> statements

StatementKind (enum class)
    -> Function
    -> Call

Statement (struct)
    -> StatementKind kind

Function : Statement (struct)
    -> std::string name
    -> std::vector<Statement*> body

Call : Statement (struct)
    -> std::string name
    -> Function* called

然后构建初始 AST 的语法过程可能如下所示:

Then the syntactic pass to build the initial AST might look like:

auto ast = parseAST();

where parseAST 重复调用 parseStatement,它消耗和/或偷看令牌以确定语句是函数定义还是函数调用,然后调用 parseFunctionparseCall 适当.这称为手写递归下降解析器,根据我的经验,这是迄今为止您可以编写的最简单和最好的解析器类型.是的,需要编写样板文件,但代码也非常清晰灵活.

where parseAST calls parseStatement repeatedly, which consumes and/or peeks at tokens to determine whether the statement is a function definition or a function call, then calls parseFunction or parseCall appropriately. This is called a hand-written recursive descent parser, and is in my experience by far the simplest and best type of parser you can write. Yes there is boilerplate to write, but it's also very clear and flexible code.

句法阶段生成语法错误消息,当发现错误时尝试尽可能地恢复(这是分号分隔语言的一个参数——编译器和工具通常可以从错误中恢复更多很容易,因为在尝试解析下一个构造之前遇到错误时通常跳到下一个分号就足够了).

The syntactic phase generates syntax error messages as it goes, attempting to recover as best as it can when an error is found (this is one argument for semicolon-delimited languages -- the compilers and tools can often recover from errors much more easily, since it often suffices to skip to the next semicolon when an error is encountered before trying to parse the next construct).

如果允许在定义之前调用函数,这也很容易处理——只需解析调用部分,而无需检查第一次解析时是否存在具有该名称的函数,然后再关联它.

If a function is allowed to be called before it is defined, this is simple to handle too -- just parse the call part without checking if a function with that name exists at the time you first parse it, then correlate it later.

通过 AST 的第二个语义传递将使用任何缺失的交叉引用数据对其进行注释(例如,使用这些函数的定义解析函数调用的函数名称,或者如果未找到名称则生成错误).一旦完成,你就有了一个完整的 AST,它保证代表一个有效的程序(因为你在这个过程中做了所有的语义错误检查),并且可以对其应用优化,然后是代码生成阶段(以及更多的优化)方式).

A second, semantic, pass through the AST would annotate it with any missing cross-referenced data (e.g. resolving function calls' function names with those functions' definitions, or generating an error if the name is not found). Once that's done, you have a full AST that's guaranteed to represent a valid program (since you did all the semantic error checking in this pass), and can have optimizations applied to it, followed by the code generation phase (and more optimizations along the way).

请注意,我完全没有提及 LL 或 LR 解析器等.您的语言的理论符号和分类很重要(例如,因为它可以证明它是非歧义的),但从实现的角度来看,拥有干净、易于理解且最重要的是易于修改的代码比坚持特定的解析机制重要得多.使用手写解析器的其他编译器示例包括 GCC、Clang 和 Google 的 V8 等.

Note that I completely left out any mention of LL or LR parsers, etc. The theoretical notation and categorization of your language is important (as it can prove it's non-ambiguous, for example), but from an implementation point of view, it's far more important to have clean, easily understood and above all easily modifiable code than to adhere to a specific parsing mechanism. Examples of other compilers that use hand-written parsers include GCC, Clang, and Google's V8, among others.

在效率方面,AST 通常在构建后会迭代多次,并且需要在内存中保留到进程后期(可能一直到最后,取决于您如何进行代码生成),并且因此最好使用对象池为 AST 分配记录,而不是在堆上单独动态分配所有内容.最后,代替字符串,通常最好在原始源代码中使用偏移量和长度.

In terms of efficiency, the AST is generally iterated over several times after it's constructed, and needs to stick around in memory until late in the process (possibly right until the end, depending on how you do your code generation), and so it's best to use an object pool to allocate the records for your AST than to dynamically allocate everything separately on the heap. Finally, in place of strings, it's generally better to use an offset-and-length into the original source code.

这篇关于用 C++ 生成 AST的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆