在 C 中表示抽象语法树 [英] Representing an Abstract Syntax Tree in C

查看:26
本文介绍了在 C 中表示抽象语法树的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用 C 为一种简单的玩具语言实现一个编译器.我有一个工作扫描器和解析器,以及关于 AST 的概念函数/构造的合理背景.我的问题与在 C 中表示 AST 的具体方式有关.我在不同的在线文本/资源中经常遇到三种样式:

I'm implementing a compiler for a simple toy language in C. I have a working scanner and parser, and a reasonable background on the conceptual function/construction of an AST. My question is related to the specific way to represent an AST in C. I've come across three styles pretty frequently in different texts/resources online:

每种类型的节点一个结构.

它有一个基节点类"(结构),它是所有子结构中的第一个字段.基节点包含一个枚举,用于存储节点的类型(常量、二元运算符、赋值等).使用一组宏访问结构的成员,每个结构一组.它看起来像这样:

This has a base node "class"(struct) that is the first field in all the child structs. The base node contains an enum that stores the type of node(constant, binary operator, assignment, etc). Members of the struct are accessed using a set of macros, with one set per struct. It looks something like this:

struct ast_node_base {
    enum {CONSTANT, ADD, SUB, ASSIGNMENT} class;
};

struct ast_node_constant {
    struct ast_node_base *base;
    int value;
};

struct ast_node_add {
    struct ast_node_base *base;
    struct ast_node_base *left;
    struct ast_node_base *right;
};

struct ast_node_assign {
    struct ast_node_base *base;
    struct ast_node_base *left;
    struct ast_node_base *right;
};

#define CLASS(node) ((ast_node_base*)node)->class;

#define ADD_LEFT(node) ((ast_node_add*)node)->left;
#define ADD_RIGHT(node) ((ast_node_add*)node)->right;

#define ASSIGN_LEFT(node) ((ast_node_assign*)node)->left;
#define ASSIGN_RIGHT(node) ((ast_node_assign*)node)->right;

每个节点布局一个结构.

这似乎与上面的布局基本相同,除了它不是有 ast_node_add 和 ast_node_assign 而是有一个 ast_node_binary 来表示两者,因为这两个结构的布局是相同的,它们仅在 base 的内容上有所不同-> 班级.这样做的优点似乎是一组更统一的宏(LEFT(node) 用于所有具有左右两侧的节点,而不是每个节点一对宏),但缺点似乎是 C 类型检查不会那么有用(例如,没有办法检测到应该只有 ast_node_add 的 ast_node_assign).

This appears to be mostly the same as the above layout, except instead of having ast_node_add and ast_node_assign it would have an ast_node_binary to represent both, because the layout of the two structs is the same and they only differ by the contents of base->class. The advantage to this seems to be a more uniform set of macros(LEFT(node) for all nodes with a left and right instead of one pair of macros per), but the disadvantage seems that the C type checking won't be as useful(there would be no way to detect an ast_node_assign where there should only be an ast_node_add, for example).

一个结构体,用一个联合来保存不同类型的节点数据.

可以在此处找到比我能给出的更好的解释.使用上一个示例中的类型,它看起来像:

A better explanation of this than I can give can be found here. Using the types from the previous example it would look like:

struct ast_node {
  enum { CONSTANT, ADD, SUB, ASSIGNMENT } class;
  union { int                                 value;
          struct { struct ast_node* left;    
                   struct ast_node* right;  } op;
};

我最喜欢第三个选项,因为它使递归遍历更容易(因为避免了大量指针转换以支持联合),但它也没有利用 C 类型检查.第一个选项似乎是最危险的,因为它依赖于指向被强制转换的结构的指针来访问任何节点的成员(甚至同一节点的不同成员需要不同的情况来访问(基本与左)),但这些强制转换是类型检查,所以可能没有实际意义.对我来说,第二个选项似乎是两全其美的选择,尽管我可能遗漏了一些东西.

I'm inclined to like the third option the most because it makes recursive traversal much easier(in that lots of pointer casting is avoided in favor of the union), but it also doesn't take advantage of C type checking. The first option seems the most dangerous in that it relies on pointers to structs being cast to access the member of any node(even different members of the same node requiring different cases to access(base vs. left)), but these casts are type checked so that might be moot. The second option to me seems like the worst of both worlds, although maybe I'm missing something.

这三个方案中哪个最好,为什么?我还没有遇到过更好的第四个选项吗?我假设它们都不是一刀切"的解决方案,所以如果重要的话,我正在实现的语言是静态类型的命令语言,几乎是 C 的一个小子集.

Which of these three schemes are the best, and why? Is there a better fourth option I haven't come across yet? I'm assuming none of them are a "one size fits all" solution, so if it matters the language I'm implementing is a statically typed imperative language, almost a small subset of C.

我有一个关于第三个(联合)布局的具体问题.如果我只使用值字段,值后面是否会有空格以容纳op被写入的可能性?

A specific question I have about the third(union) layout. If I use only the value field, will there be empty space following the value to accommodate for the possibility of op being written to?

推荐答案

您可以使任何这些工作.

You can make any of these work.

我更喜欢联合布局,因为那样所有节点都有相同"的布局.

I prefer the union layout, because then all nodes have "the same" layout.

[你可能会发现有一个子子列表"选项很有用,例如,任意大的动态子数组,而不是左倾或右倾列表.]

[You may find it useful to have a "child sublist" option, e.g., and arbitarily big, dynamic array of children, instead of having left- or right-leaning lists.]

您会发现这个问题并不是让编译器难以构建的问题.相反,它拥有符号表、执行各种分析、选择机器级 IR、构建代码生成器并进行代码优化.然后你会遇到真正的用户,你会发现你真正做错了什么:-}

You are going to find that this issue isn't the one that makes building your compiler hard. Rather, it is having symbol tables, performing various kinds of analyses, choosing a machine-level IR, building a code generator, and doing code optimizations. Then you're going to encounter real users and you'll discover what you really did wrong :-}

我会选择一个并运行它,这样你就有机会解决其他问题.

I'd pick one and run with it, so that you have a chance to get near the other issues.

这篇关于在 C 中表示抽象语法树的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆