将 AST 编译回源代码 [英] Compiling an AST back to source code

查看:29
本文介绍了将 AST 编译回源代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在构建一个用 PHP 编写的 PHP 解析器,因为 我之前的问题.解析器本身运行良好.

I'm currently in the process of building a PHP Parser written in PHP, as no existing parser came up in my previous question. The parser itself works fairly well.

现在显然解析器本身没什么用(除了静态分析).我想对 AST 应用转换,然后将其编译回源代码.应用转换不是什么大问题,正常的访问者模式应该可以.

Now obviously a parser by itself does little good (apart from static analysis). I would like to apply transformations to the AST and then compile it back to source code. Applying the transformations isn't much of a problem, a normal Visitor pattern should do.

我目前的问题是如何将 AST 编译回源代码.我看到基本上有两种可能性:

What my problem currently is, is how to compile the AST back to source. There are basically two possibilities I see:

  1. 使用一些预定义的方案编译代码
  2. 保留原始代码的格式并仅在已更改的节点上应用 1.

现在我想专注于 1. 因为 2. 似乎很难完成(但如果你有这方面的提示,我想听听).

For now I would like to concentrate on 1. as 2. seems pretty hard to accomplish (but if you got tips concerning that, I would like to hear them).

但我不确定可以使用哪种设计模式来编译代码.我认为实现这一点的最简单方法是向所有节点添加一个 ->compile 方法.我在这里看到的缺点是很难更改生成的输出的格式.为了做到这一点,人们需要更改节点本身.因此,我正在寻找不同的解决方案.

But I'm not really sure which design pattern can be used to compile the code. The easiest way I see to implement this, is to add a ->compile method to all Nodes. The drawback I see here, is that it would be pretty hard to change the formatting of the generated output. One would need to change the Nodes itself in order to do that. Thus I'm looking for a different solution.

我听说访问者模式也可以用于此,但我无法想象这应该如何工作.据我了解访问者模式,您有一些 NodeTraverser 递归遍历所有节点并调用 Visitor->visit 方法.这对于节点操作听起来很有希望,其中 Visitor->visit 方法可以简单地更改它传递的节点,但我不知道它如何用于编译.一个明显的想法是从叶子到根迭代节点树,并用源代码替换访问过的节点.但这不知何故似乎不是一个非常干净的解决方案?

I have heard that the Visitor pattern can be used for this, too, but I can't really imagine how this is supposed to work. As I understand the visitor pattern you have some NodeTraverser that iterates recursively over all Nodes and calls a ->visit method of a Visitor. This sounds pretty promising for node manipulation, where the Visitor->visit method could simply change the Node it was passed, but I don't know how it can be used for compilation. An obvious idea would be to iterate the node tree from leaves to root and replace the visited nodes with source code. But this somehow doesn't seem a very clean solution?

推荐答案

将 AST 转换回源代码的问题通常称为漂亮打印".有两种微妙的变化:重新生成尽可能匹配原始文本的文本(我称之为保真打印"),以及(漂亮的)漂亮打印,它生成格式良好的文本.以及如何打印很重要取决于编码人员是在处理重新生成的代码(他们通常需要保真打印)还是您唯一的目的是编译它(此时任何合法的漂亮打印都可以).

The problem of converting an AST back into source code is generally called "prettyprinting". There are two subtle variations: regenerating the text matching the original as much as possible (I call this "fidelity printing"), and (nice) prettyprinting, which generates nicely formatted text. And how you print matters depending on whether coders will be working on the regenerated code (they often want fidelity printing) or your only intention is to compile it (at which point any legal prettyprinting is fine).

要做好漂亮的打印通常需要比经典解析器收集的更多信息,而大多数解析器生成器不支持这种额外信息收集的事实使情况变得更糟.我将收集足够信息的解析器称为重新设计解析器".更多详情如下.

To do prettyprinting well requires usually more information than a classic parser collects, aggravated by the fact that most parser generators don't support this extra-information collection. I call parsers that collect enough information to do this well "re-engineering parsers". More details below.

完成漂亮打印的基本方法是遍历 AST(如您所说的访问者模式"),并根据 AST 节点内容生成文本.基本技巧是:从左到右调用子节点(假设这是原始文本的顺序)以生成它们表示的文本,并根据此 AST 节点类型穿插其他文本.要漂亮地打印语句块,您可能具有以下伪代码:

The fundamental way prettyprinting is accomplished is by walking the AST ("Visitor pattern" as you put it), and generating text based on the AST node content. The basic trick is: call children nodes left-to-right (assuming that's the order of the original text) to generate the text they represent, interspersing additional text as appropriate for this AST node type. To prettyprint a block of statements you might have the following psuedocode:

 PrettyPrintBlock:
     Print("{"}; PrintNewline();
     Call PrettyPrint(Node.children[1]); // prints out statements in block
     Print("}"); PrintNewline();
     return;


 PrettyPrintStatements:
     do i=1,number_of_children
         Call PrettyPrint(Node.children[i]); Print(";"); PrintNewline(); // print one statement
     endo
     return;

请注意,这会在您访问树时即时吐出文本.

Note that this spits out text on the fly as you visit the tree.

您需要管理许多细节:

  • 对于表示文字的 AST 节点,您必须重新生成文字值.如果您希望答案准确,这比看起来更难.在不损失任何精度的情况下打印浮点数比看起来要困难很多(当你破坏 Pi 的值时,科学家们讨厌它).对于字符串文字,您必须重新生成引号和字符串文字内容;您必须小心为必须转义的字符重新生成转义序列.PHP 双引号字符串文字可能有点困难,因为它们在 AST 中不是由单个标记表示的.(我们的 PHP 前端(重新设计的解析器/prettyprinter 将它们本质上表示为一个表达式连接字符串片段,从而能够在字符串literal"内应用转换).

  • For AST nodes representing literals, you have to regenerate the literal value. This is harder than it looks if you want the answer to be accurate. Printing floating point numbers without losing any precision is a lot harder than it looks (scientists hate it when you damage the value of Pi). For string literals, you have to regenerate the quotes and the string literal content; you have to be careful to regenerate escape sequences for characters that have to be escaped. PHP doubly-quoted string literals may be a bit more difficult, as they are not represented by single tokens in the AST. (Our PHP Front End (a reengineering parser/prettyprinter represents them essentially as an expression that concatenates the string fragments, enabling transformations to be applied inside the string "literal").

间距:某些语言在关键位置需要空格.标记 ABC17 42 最好不要打印为 ABC1742,但是标记 (ABC17) 可以打印为 (ABC17).解决这个问题的一种方法是在合法的地方放置一个空格,但人们不会喜欢这样的结果:太多的空格.如果您只编译结果,这不是问题.

Spacing: some languages require whitespace in critical places. The tokens ABC17 42 better not be printed as ABC1742, but it is ok for the tokens ( ABC17 ) to be printed as (ABC17). One way to solve this problem is to put a space wherever it is legal, but people won't like the result: too much whitespace. Not an issue if you are only compiling the result.

换行符:允许任意空格的语言在技术上可以重新生成为一行文本.人们讨厌这个,即使你要编译结果;有时您必须查看生成的代码,这使得它变得不可能.因此,您需要一种方法来为代表主要语言元素(语句、块、方法、类等)的 AST 节点引入换行符.这通常并不难;当访问代表这种构造的节点时,打印出构造并附加一个换行符.

Newlines: languages that allow arbitrary whitespace can technically be regenerated as a single line of text. People hate this, even if you are going to compile the result; sometimes you have to look at the generated code and this makes it impossible. So you need a way to introduce newlines for AST nodes representing major language elements (statements, blocks, methods, classes, etc.). This isn't usually hard; when visiting a node representing such a construct, print out the construct and append a newline.

你会发现,如果你想让用户接受你的结果,你将不得不保留一些你通常不会想到存储的源文本的属性对于文字,您可能需要重新生成文字的基数;将数字作为十六进制文字输入的编码员在重新生成十进制等价物时并不满意,即使它的含义完全相同.同样,字符串必须具有原始"引号;大多数语言允许 " 或 ' 作为字符串引号字符,人们想要他们最初使用的.对于 PHP,您使用哪个引号很重要,并确定字符串文字中的哪些字符必须被转义.某些语言允许大写或小写关键字(甚至缩写),以及大写和小写的变量名表示相同的变量;再次,原始作者通常希望返回原始外壳.PHP 在不同类型的标识符(例如,$")中有有趣的字符,但您会发现它并不总是存在(参见文本字符串中的 $ 变量).通常人们想要他们原来的布局格式;要做到这一点,你必须存储具体标记的列号信息,并有关于何时使用该列号数据在可能的情况下将漂亮打印的文本定位在同一列中的漂亮打印规则,以及如果到目前为止该怎么做- 漂亮的打印行在该列之后被填充.

You will discover, if you want users to accept your result, that you will have to preserve some properties of the source text that you wouldn't normally think to store For literals, you may have to regenerate the radix of the literal; coders having entered a number as a hex literal are not happy when you regenerate the decimal equivalent even though it means exactly the same thing. Likewise strings have to have the "original" quotes; most languages allow either " or ' as string quote characters and people want what they used originally. For PHP, which quote you use matters, and determines which characters in the string literal has to be escaped. Some languages allow upper or lower case keywords (or even abbreviations), and upper and lower case variable names meaning the same variable; again the original authors typically want their original casing back. PHP has funny characters in different type of identifiers (e.g., "$") but you'll discover that it isn't always there (see $ variables in literal strings). Often people want their original layout formatting; to do this you have to store at column-number information for concrete tokens, and have prettyprinting rules about when to use that column-number data to position prettyprinted text where in the same column when possible, and what to do if the so-far-prettyprinted line is filled past that column.

评论:大多数标准解析器(包括您使用 Zend 解析器实现的,我很确定)完全丢弃了注释.同样,人们讨厌这一点,并且会拒绝一个漂亮的答案,其中评论丢失了.这是一些漂亮的打印机试图通过使用原始文本重新生成代码的主要原因(另一种是复制原始代码布局以进行保真打印,如果您没有捕获列号信息).恕我直言,正确的技巧是捕获 AST 中的注释,以便 AST 转换也可以检查/生成注释,但每个人都有自己的设计选择.

Comments: Most standard parsers (including the one you implemented using the Zend parser, I'm pretty sure) throw comments away completely. Again, people hate this, and will reject a prettyprinted answer in which comments are lost. This is the principal reason that some prettyprinters attempt to regenerate code by using the original text (the other is to copy the original code layout for fidelity printing, if you didn't capture column-number information). IMHO, the right trick is to capture the comments in the AST, so that AST transformations can inspect/generate comments too, but everybody makes his own design choice.

所有这些额外"信息都是由一个好的重新设计解析器收集的.传统的解析器通常不会收集任何它,这使得打印可接受的 AST 变得困难.

All of this "extra" information is collected by a good reenginering parser. Conventional parsers usually don't collect any of it, which makes printing acceptable ASTs difficult.

一种更有原则的方法区分了以漂亮的格式为目的的漂亮打印,与旨在最大程度地重新生成文本以匹配原始来源的保真打印.应该清楚的是,在终端级别,您非常需要保真打印.根据您的目的,您可以使用漂亮的格式或保真打印进行漂亮的打印.我们使用的一种策略是在 AST 未更改时默认为保真打印,并在已更改的位置进行漂亮打印(因为更改机制通常没有关于列号或数字基数等的任何信息).转换将新生成的 AST 节点标记为不存在保真数据".

A more principled approach distinguishes prettyprinting whose purpose is nice formatting, from fidelity printing whose purpose is to regenerate the text to match the original source to a maximal extent. It should be clear that at the level of the terminals, you pretty much want fidelity printing. Depending on your purpose, you can pretty print with nice formatting, or fidelity printing. A strategy we use is to default to fidelity printing when the AST hasn't been changed, and prettyprinting where it has (because often the change machinery doesn't have any information about column numbers or number radixes, etc.). The transformations stamp the AST nodes that are newly generated as "no fidelity data present".

漂亮打印的一种有组织的方法是理解几乎所有基于文本的编程语言都可以根据矩形文本块很好地呈现.(Knuth 的 TeX 文档生成器也有这个想法).如果您有一组文本框代表重新生成的代码片段(例如,直接为终端标记生成的原始框),那么您可以想象组合这些框的运算符: 水平组合(将一个框堆叠到另一个框的右侧),垂直(将框堆叠在一起;这实际上取代了打印换行符)、缩进(带有一盒空白的水平组合)等.然后您可以通过构建和组合文本框来构建您的漂亮打印机:

An organized approach to prettyprinting nicely is to understand that virtually all text-based programming language are rendered nicely in terms of rectangular blocks of text. (Knuth's TeX document generator has this idea, too). If you have some set of text boxes representing pieces of the regenerated code (e.g., primitive boxes generated directly for the terminal tokens), you can then imagine operators for composing those boxes: Horizontal composition (stack one box to the right of another), Vertical (stack boxes on top of each other; this in effect replaces printing newlines), Indent (Horizontal composition with a box of blanks), etc. Then you can construct your prettyprinter by building and composing text boxes:

 PrettyPrintBlock:
     Box1=PrimitiveBox("{"); Box2=PrimitiveBox("}");
     ChildBox=PrettyPrint(Node.children[1]); // gets box for statements in block
     ResultBox=VerticalBox(Box1,Indent(3,ChildBox),Box2);
     return ResultBox;

PrettyPrintStatements:
     ResultBox=EmptyBox();
     do i=1,number_of_children
         ResultBox=VerticalBox(ResultBox,HorizontalBox(PrettyPrint(Node.children[i]); PrimitiveBox(";")
     endo
     return;

其中的真正价值是任何节点都可以将其子节点生成的文本框以任意顺序与任意中间文本组合在一起.您可以通过这种方式重新排列大量文本(想象一下 VBox 以方法名称顺序排列类的方法).遇到时不会吐出任何文字;仅当到达根节点,或已知所有子框都已正确生成的某个 AST 节点时.

The real value in this is any node can compose the text boxes produced by its children in arbitrary order with arbitrary intervening text. You can rearrange huge blocks of text this way (imagine VBox'ing the methods of class in method-name order). No text is spit out as encountered; only when the root is reached, or some AST node where it is known that all the children boxes have been generated correctly.

我们的 DMS Software Reengineering Toolkit 使用这种方法来漂亮打印它的所有语言可以解析(包括PHP、Java、C#等).我们不是通过访问者将框计算附加到 AST 节点,而是将框计算附加在特定于域的文本框符号中

Our DMS Software Reengineering Toolkit uses this approach to prettyprint all the languages it can parse (including PHP, Java, C#, etc.). Instead of attaching the box computations to AST nodes via visitors, we attach the box computations in a domain-specific text-box notation

  • H(...) 用于水平框
  • V(....) 用于垂直框
  • I(...) 表示缩进框)

直接到语法规则,让我们可以在一个地方简洁地表达语法(解析器)和漂亮打印器(反解析器").Prettyprinter 框规则由 DMS 自动编译到访问者中.Prettyprinter 机器必须足够聪明才能理解评论是如何发挥作用的,坦率地说,这有点神秘,但您只需要做一次.DMS 示例:

directly to the grammar rules, allowing us to succinctly express the grammar (parser) and the prettyprinter ("anti-parser") in one place. The prettyprinter box rules are compiled automatically by DMS into a visitor. The prettyprinter machinery has to be smart enough to understand how comments play into this, and that's frankly a bit arcane but you only have to do it once. An DMS example:

block = '{' statements '}' ; -- grammar rule to recognize block of statements
<<PrettyPrinter>>: { V('{',I(statements),'}'); };

您可以看到一个更大的示例,说明如何为 Wirth 的 Oberon 编程语言 PrettyPrinter 显示了语法规则和漂亮打印规则是如何组合的.PHP 前端看起来像这样,但显然要大得多.

You can see a bigger example of how this is done for Wirth's Oberon programming language PrettyPrinter showing how grammar rules and prettyprinting rules are combined. The PHP Front End looks like this but its a lot bigger, obviously.

进行漂亮打印的一种更复杂的方法是构建一个语法导向的翻译器(意味着,遍历树并以树状顺序构建文本或其他数据结构)以在特殊的文本框 AST 中生成文本框.文本框 AST 然后由另一个树遍历漂亮地打印,但它的操作基本上是微不足道的:打印文本框.请参阅此技术论文:Pretty-打印软件再造

A more complex way to do prettyprinting is to build a syntax-directed translator (means, walk the tree and build text or other data structures in tree-visted order) to produce text-boxes in a special text-box AST. The text-box AST is then prettyprinted by another tree walk, but the actions for it are basically trivial: print the text boxes. See this technical paper: Pretty-printing for software reengineering

另外一点:你当然可以自己建造所有这些机器.但是,您选择使用解析器生成器的相同原因(制作一个解析器需要做很多工作,而且这项工作不会以有趣的方式为您的目标做出贡献)与您想要选择一个现成的-货架上的漂亮打印机生成器.周围有很多解析器生成器.没有多少漂亮的打印机生成器.[DMS 是为数不多的两者之一.]

An additional point: you can of course go build all this machinery yourself. But the same reason that you choose to use a parser generator (its a lot of work to make one, and that work doesn't contribute to your goal in an interesting way) is the same reason you want to choose an off-the-shelf prettyprinter generator. There are lots of parser generators around. Not many prettyprinter generators. [DMS is one of the few that has both built in.]

这篇关于将 AST 编译回源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆