将AST编译回源代码 [英] Compiling an AST back to source code

查看:333
本文介绍了将AST编译回源代码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在建立一个以PHP编写的PHP Parser,因为没有现有的解析器出现在我的上一个问题解析器本身效果非常好。



现在显然是解析器本身没有什么好(除了静态分析)。我想将转换应用到AST,然后将其编译回源代码。应用转换不是很大的问题,正常的访问者模式应该做。



我目前的问题是,如何编译AST回源。基本上有两种可能性:


  1. 使用某种预定义的方案编译代码

  2. 格式化原始代码并仅应用于更改的节点。

现在我想集中精力于1. 2.看起来很难完成(但如果你有这方面的提示,我想听听他们)。



但我不确定哪种设计模式用于编译代码。我看到实现这个最简单的方法是添加一个 - > compile 方法到所有节点。我在这里看到的缺点是,很难改变生成的输出的格式。为了做到这一点,需要改变节点本身。因此我在寻找一个不同的解决方案。



我听说过Visitor模式也可以用于这个,但我真的不能想象这是怎么回事应该工作。我理解访问者模式,你有一些 NodeTraverser 在所有节点上递归迭代,并调用 - >访问方法的访客。这听起来很有希望的节点操作,其中访问>访问方法可以简单地更改它被传递的节点,但我不知道如何可以使用汇编。一个明显的想法是从叶到根迭代节点树,并用源代码替换被访问的节点。

解决方案

将AST转换为源代码的问题通常被称为prettyprinting。有两个微妙的变化:尽可能多地重新生成与原始文本匹配的文本(我称之为保真打印),以及(nice)prettyprinting,它生成格式正确的文本。和你如何打印事项
取决于编码员是否将工作的重新生成的代码(他们通常希望保真打印)或你的只有
的意图是编译它(在这一点上任何法律prettyprinting是好的)。



要做漂亮的打印需要通常比传统的解析器收集更多的信息,加剧了大多数解析器生成器不支持这个额外的信息收集的事实。我调用解析器收集足够的信息做好这个重新设计解析器。更多细节如下。



漂亮打印的基本方法是通过走AST(访问者模式,如你所述),并基于AST节点内容生成文本。基本的诀窍是:从左到右调用子节点(假设是原始文本的顺序)以生成它们表示的文本,根据AST节点类型适当地插入其他文本。要对一组语句进行美化,可能会有以下伪代码:

  PrettyPrintBlock:
Print({}; PrintNewline();
调用PrettyPrint(Node.children [1]); //在块中打印语句
打印(}); PrintNewline();
return;


PrettyPrintStatements:
do i = 1,number_of_children
调用PrettyPrint(Node.children [i]); Print(;); PrintNewline一个语句
endo
return;

请注意,



您需要管理一些详细信息:




  • 对于表示文字的AST节点,您必须重新生成字面值,如果您想要的答案是准确的,这比看起来难。打印浮点数而不损失任何精度是一个比它看起来更难(科学家讨厌它,当你损害Pi的价值)。对于字符串文字,您必须重新生成引号和字符串文字内容;您必须小心为要转义的字符重新生成转义序列。 PHP双引号字符串文字可能有点困难,因为它们不是由AST中的单个令牌表示。 (我们的 PHP前端(重新设计解析器/ prettyprinter本质上将它们表示为连接字符串片段的表达式


  • 间距:某些语言需要在关键位置使用空格,令牌ABC17 42更好不打印作为ABC1742,但它是确定的令牌(ABC17)打印为(ABC17)。解决这个问题的一种方法是放在一个空间,无论它是合法的,但人们不喜欢的结果:太多的空格。


  • 换行符:允许任意空格的语言在技术上可以重新生成为一行文本。人们讨厌这个,即使你要编译结果;有时你有查看生成的代码,这使它不可能。所以你需要一种方法来引入换行的AST节点代表主要的语言元素(语句,块,方法,类等)。这通常不难;


  • 您会发现,如果您希望用户接受您的结果,你将不得不保留源文本的一些属性,你通常不会认为存储
    对于文字,你可能不得不重新生成文字的基数;输入数字作为十六进制字符的编码器在您重新生成十进制等值时不高兴,即使它意味着完全相同的事情。同样,字符串必须有原始引号;大多数语言允许或作为字符串引用字符,并且人们想要他们最初使用的内容。对于PHP,它引用你使用的事物,并确定字符串中的哪些字符必须被转义
    一些语言允许上或小写关键字(或者甚至缩写)和大写和小写变量名,意味着同一个变量;原始作者通常想要他们的原始外壳。PHP在不同类型的标识符(例如$)中有有趣的字符,会发现它并不总是存在(见字符串中的$变量)。通常人们想要他们的原始布局格式;为此,你必须存储在具体令牌的列号信息,并且有关于什么时候在可能的情况下,使用该列数数据将美丽文本定位在同一列中,以及如果在该列之上填充了超过漂亮的行,该如何处理。


  • 注释:大多数标准解析器(包括使用Zend解析器实现的解析器,我很确定)抛弃注释。再次,人们讨厌这个,并将拒绝一个美丽的答案,其中的评论丢失。这是一些破碎者尝试使用原始文本
    (另一个是复制原始代码布局进行保真打印,如果您没有捕获列号信息)来重新生成代码的主要原因。 IMHO,正确的诀窍是捕获AST中的注释,以便AST转换可以检查/生成注释,但每个人都自己设计选择。




所有这些额外信息由一个良好的重入解析器收集。



一个更有原则的方法可以区分美观的打印,其目的是很好的格式化,从保真打印的目的是重新生成文本以匹配原始源到最大程度。应该清楚的是,在终端的水平,你几乎想要保真打印。根据您的目的,你可以漂亮的打印与好的格式,或保真打印。我们使用的策略是在AST未更改时默认使用保真打印,并且在其中存在漂亮的打印(因为更改机制通常没有关于列号或数字基数的信息等)。转换将新生成的AST节点标记为没有保真数据存在。



一个有组织的漂亮打印方法是理解几乎所有基于文本的编程语言在矩形文本块中很好地呈现。 (Knuth的TeX文档生成器也有这个想法)。如果你有一些代表重新生成的代码片段的文本框(例如,直接为终端标记生成的原始框),那么你可以想象组合这些框的操作符:水平组合(将一个框堆叠在另一个框的右边)垂直(彼此堆叠的堆叠框;这实际上替换打印换行),缩进(具有空白框的水平组合)等。然后,可以通过构建和编写文本框来构造prettyprinter:

  PrettyPrintBlock:
Box1 = PrimitiveBox({); Box2 = PrimitiveBox(});
ChildBox = prettyPrint(Node.children [1]); //为块中的语句获取框
ResultBox = VerticalBox(Box1,Indent(3,ChildBox),Box2);
return ResultBox;

PrettyPrintStatements:
ResultBox = EmptyBox();
do i = 1,number_of_children
ResultBox = VerticalBox(ResultBox,Horizo​​ntalBox(PrettyPrint(Node.children [i]); PrimitiveBox(;)
endo
return;

这里的真正价值是任何节点可以任意顺序组成由其子任意产生的文本框你可以通过这种方式重新排列巨大的文本块(想象一下VBox在方法名顺序中的类的方法)。没有文本出现在遇到;只有当根到达,或者一些AST节点,它是已知所有子项都已正确生成。



我们的 DMS软件重构工具包使用这种方法来打印它可以解析的所有语言(包括PHP,Java,C#等),而不是通过访问者将框计算附加到AST节点,我们将框计算

  • V(....)用于垂直方框
  • I(...)用于缩进框)


  • $ b b

    直接到语法规则,允许我们在一个地方简洁地表达语法(解析器)和prettyprinter(反解析器)。 prettyprinter框规则由DMS自动编译成访问者。 prettyprinter机制必须足够聪明,以了解评论如何发挥到这,坦率地说,有点奥秘,但你只需要做一次。 DMS示例:

      block ='{'statements'}'; - 语法规则识别语句块
    << PrettyPrinter>> ;: {V('{',I(statements),'}'); };

    您可以看到一个更大的例子,说明如何对 Wirth的Oberon编程语言PrettyPrinter ,展示了如何组合语法规则和美感打印规则。



    一个更复杂的方式来做prettyprinting是建立一个语法导向的翻译(意味着,
    以树的顺序遍历树并构建文本或其他数据结构)以在特殊文本框AST中产生文本框。然后,文本框AST由另一个树行走着色,但它的动作基本上是微不足道的:打印文本框。
    请参阅此技术文档:美丽打印用于软件重新设计



    另外一点:你自己可以自己建立这个机器。但同样的原因,你选择使用解析器生成器(它的很多工作,使一个,这项工作不有助于你的目标有趣的方式)是同样的原因,你想选择一个off-the-书架prettyprinter发电机。有很多解析器生成器。没有很多prettyprinter发电机。 [DMS是已经内置的少数几个之一。]


    I'm currently in the process of building a PHP Parser written in PHP, as no existing parser came up in my previous question. The parser itself works fairly well.

    Now obviously a parser by itself does little good (apart from static analysis). I would like to apply transformations to the AST and then compile it back to source code. Applying the transformations isn't much of a problem, a normal Visitor pattern should do.

    What my problem currently is, is how to compile the AST back to source. There are basically two possibilities I see:

    1. Compile the code using some predefined scheme
    2. Keep the formatting of the original code and apply 1. only on Nodes that were changed.

    For now I would like to concentrate on 1. as 2. seems pretty hard to accomplish (but if you got tips concerning that, I would like to hear them).

    But I'm not really sure which design pattern can be used to compile the code. The easiest way I see to implement this, is to add a ->compile method to all Nodes. The drawback I see here, is that it would be pretty hard to change the formatting of the generated output. One would need to change the Nodes itself in order to do that. Thus I'm looking for a different solution.

    I have heard that the Visitor pattern can be used for this, too, but I can't really imagine how this is supposed to work. As I understand the visitor pattern you have some NodeTraverser that iterates recursively over all Nodes and calls a ->visit method of a Visitor. This sounds pretty promising for node manipulation, where the Visitor->visit method could simply change the Node it was passed, but I don't know how it can be used for compilation. An obvious idea would be to iterate the node tree from leaves to root and replace the visited nodes with source code. But this somehow doesn't seem a very clean solution?

    解决方案

    The problem of converting an AST back into source code is generally called "prettyprinting". There are two subtle variations: regenerating the text matching the original as much as possible (I call this "fidelity printing"), and (nice) prettyprinting, which generates nicely formatted text. And how you print matters depending on whether coders will be working on the regenerated code (they often want fidelity printing) or your only intention is to compile it (at which point any legal prettyprinting is fine).

    To do prettyprinting well requires usually more information than a classic parser collects, aggravated by the fact that most parser generators don't support this extra-information collection. I call parsers that collect enough information to do this well "re-engineering parsers". More details below.

    The fundamental way prettyprinting is accomplished is by walking the AST ("Visitor pattern" as you put it), and generating text based on the AST node content. The basic trick is: call children nodes left-to-right (assuming that's the order of the original text) to generate the text they represent, interspersing additional text as appropriate for this AST node type. To prettyprint a block of statements you might have the following psuedocode:

     PrettyPrintBlock:
         Print("{"}; PrintNewline();
         Call PrettyPrint(Node.children[1]); // prints out statements in block
         Print("}"); PrintNewline();
         return;
    
    
     PrettyPrintStatements:
         do i=1,number_of_children
             Call PrettyPrint(Node.children[i]); Print(";"); PrintNewline(); // print one statement
         endo
         return;
    

    Note that this spits out text on the fly as you visit the tree.

    There's a number of details you need to manage:

    • For AST nodes representing literals, you have to regenerate the literal value. This is harder than it looks if you want the answer to be accurate. Printing floating point numbers without losing any precision is a lot harder than it looks (scientists hate it when you damage the value of Pi). For string literals, you have to regenerate the quotes and the string literal content; you have to be careful to regenerate escape sequences for characters that have to be escaped. PHP doubly-quoted string literals may be a bit more difficult, as they are not represented by single tokens in the AST. (Our PHP Front End (a reengineering parser/prettyprinter represents them essentially as an expression that concatenates the string fragments, enabling transformations to be applied inside the string "literal").

    • Spacing: some languages require whitespace in critical places. The tokens ABC17 42 better not be printed as ABC1742, but it is ok for the tokens ( ABC17 ) to be printed as (ABC17). One way to solve this problem is to put a space wherever it is legal, but people won't like the result: too much whitespace. Not an issue if you are only compiling the result.

    • Newlines: languages that allow arbitrary whitespace can technically be regenerated as a single line of text. People hate this, even if you are going to compile the result; sometimes you have to look at the generated code and this makes it impossible. So you need a way to introduce newlines for AST nodes representing major language elements (statements, blocks, methods, classes, etc.). This isn't usually hard; when visiting a node representing such a construct, print out the construct and append a newline.

    • You will discover, if you want users to accept your result, that you will have to preserve some properties of the source text that you wouldn't normally think to store For literals, you may have to regenerate the radix of the literal; coders having entered a number as a hex literal are not happy when you regenerate the decimal equivalent even though it means exactly the same thing. Likewise strings have to have the "original" quotes; most languages allow either " or ' as string quote characters and people want what they used originally. For PHP, which quote you use matters, and determines which characters in the string literal has to be escaped. Some languages allow upper or lower case keywords (or even abbreviations), and upper and lower case variable names meaning the same variable; again the original authors typically want their original casing back. PHP has funny characters in different type of identifiers (e.g., "$") but you'll discover that it isn't always there (see $ variables in literal strings). Often people want their original layout formatting; to do this you have to store at column-number information for concrete tokens, and have prettyprinting rules about when to use that column-number data to position prettyprinted text where in the same column when possible, and what to do if the so-far-prettyprinted line is filled past that column.

    • Comments: Most standard parsers (including the one you implemented using the Zend parser, I'm pretty sure) throw comments away completely. Again, people hate this, and will reject a prettyprinted answer in which comments are lost. This is the principal reason that some prettyprinters attempt to regenerate code by using the original text (the other is to copy the original code layout for fidelity printing, if you didn't capture column-number information). IMHO, the right trick is to capture the comments in the AST, so that AST transformations can inspect/generate comments too, but everybody makes his own design choice.

    All of this "extra" information is collected by a good reenginering parser. Conventional parsers usually don't collect any of it, which makes printing acceptable ASTs difficult.

    A more principled approach distinguishes prettyprinting whose purpose is nice formatting, from fidelity printing whose purpose is to regenerate the text to match the original source to a maximal extent. It should be clear that at the level of the terminals, you pretty much want fidelity printing. Depending on your purpose, you can pretty print with nice formatting, or fidelity printing. A strategy we use is to default to fidelity printing when the AST hasn't been changed, and prettyprinting where it has (because often the change machinery doesn't have any information about column numbers or number radixes, etc.). The transformations stamp the AST nodes that are newly generated as "no fidelity data present".

    An organized approach to prettyprinting nicely is to understand that virtually all text-based programming language are rendered nicely in terms of rectangular blocks of text. (Knuth's TeX document generator has this idea, too). If you have some set of text boxes representing pieces of the regenerated code (e.g., primitive boxes generated directly for the terminal tokens), you can then imagine operators for composing those boxes: Horizontal composition (stack one box to the right of another), Vertical (stack boxes on top of each other; this in effect replaces printing newlines), Indent (Horizontal composition with a box of blanks), etc. Then you can construct your prettyprinter by building and composing text boxes:

     PrettyPrintBlock:
         Box1=PrimitiveBox("{"); Box2=PrimitiveBox("}");
         ChildBox=PrettyPrint(Node.children[1]); // gets box for statements in block
         ResultBox=VerticalBox(Box1,Indent(3,ChildBox),Box2);
         return ResultBox;
    
    PrettyPrintStatements:
         ResultBox=EmptyBox();
         do i=1,number_of_children
             ResultBox=VerticalBox(ResultBox,HorizontalBox(PrettyPrint(Node.children[i]); PrimitiveBox(";")
         endo
         return;
    

    The real value in this is any node can compose the text boxes produced by its children in arbitrary order with arbitrary intervening text. You can rearrange huge blocks of text this way (imagine VBox'ing the methods of class in method-name order). No text is spit out as encountered; only when the root is reached, or some AST node where it is known that all the children boxes have been generated correctly.

    Our DMS Software Reengineering Toolkit uses this approach to prettyprint all the languages it can parse (including PHP, Java, C#, etc.). Instead of attaching the box computations to AST nodes via visitors, we attach the box computations in a domain-specific text-box notation

    • H(...) for Horizontal boxes
    • V(....) for vertical boxes
    • I(...) for indented boxes)

    directly to the grammar rules, allowing us to succinctly express the grammar (parser) and the prettyprinter ("anti-parser") in one place. The prettyprinter box rules are compiled automatically by DMS into a visitor. The prettyprinter machinery has to be smart enough to understand how comments play into this, and that's frankly a bit arcane but you only have to do it once. An DMS example:

    block = '{' statements '}' ; -- grammar rule to recognize block of statements
    <<PrettyPrinter>>: { V('{',I(statements),'}'); };
    

    You can see a bigger example of how this is done for Wirth's Oberon programming language PrettyPrinter showing how grammar rules and prettyprinting rules are combined. The PHP Front End looks like this but its a lot bigger, obviously.

    A more complex way to do prettyprinting is to build a syntax-directed translator (means, walk the tree and build text or other data structures in tree-visted order) to produce text-boxes in a special text-box AST. The text-box AST is then prettyprinted by another tree walk, but the actions for it are basically trivial: print the text boxes. See this technical paper: Pretty-printing for software reengineering

    An additional point: you can of course go build all this machinery yourself. But the same reason that you choose to use a parser generator (its a lot of work to make one, and that work doesn't contribute to your goal in an interesting way) is the same reason you want to choose an off-the-shelf prettyprinter generator. There are lots of parser generators around. Not many prettyprinter generators. [DMS is one of the few that has both built in.]

    这篇关于将AST编译回源代码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆