antlr 4:所有这些令牌都应该显示在AST中吗? [英] antlr 4: Should all of these tokens be showing up in the AST?
问题描述
我的最终目标是将结构化文件解析为内存中对象的树,然后可以对其进行操作.我使用的文件格式相当复杂,大约有200个关键字/标签,这似乎是学习解析器/词法分析器框架的一个很好的理由.
My ultimate goal is to parse a structured file as a tree of in-memory objects that I can then manipulate. The file format that I'm using is fairly sophisticated with about 200 keywords/tags, and this seemed like a good reason to learn about parser/lexer frameworks.
不幸的是,有太多的概念(以及成百上千的教程和指南),到目前为止,学习过程感觉就像是尝试从消防水带喝水.因此,我采取了一些非常微不足道的步骤,从此示例.
Unfortunately, there are so many concepts (and hundreds of tutorials and guides) that the learning process so far feels like trying to drink from a fire hose. So I'm taking some very meager baby steps, starting with this example.
我修改了语法以创建以下测试Nano.g4:
I modified the grammar to create the following test, Nano.g4:
grammar Nano;
r : root ;
root : START ROOT ID END ROOT;
START : 'StartBlock' ;
END : 'EndBlock' ;
ROOT : 'RootItem' ;
ID : [a-z]+ ; // match lower-case identifiers
WS : [ \t\r\n]+ -> skip ; // skip spaces, tabs, newlines
接下来,我创建了一个简单的输入文件nano.txt:
Next, I created a simple input file, nano.txt:
StartBlock RootItem
foo
EndBlock RootItem
然后我使用以下命令加载代码:
I then loaded the code using the following commands:
del *.class
del *.java
java org.antlr.v4.Tool Nano.g4
javac nano*.java
java org.antlr.v4.runtime.misc.TestRig Nano r -gui < nano.txt
这给了我这个结果:
上面的树是我对词法分析器和解析器的期望的第一个概念性的宿醉.为了使输入文件合法,"StartBlock RootItem"和"EndBlock RootItem"标记是必需的,但是从概念上讲,在证明文件格式正确后,我不需要它们.从那时起,我唯一关心的是存在一个包含"foo"的RootItem,如下所示:
The tree above is my first conceptual hangup about what to expect from a lexer and parser. The "StartBlock RootItem" and "EndBlock RootItem" tokens are necessary in terms of making the input file legal, but conceptually I don't need them after I've proven that the file is properly formatted. The only thing that I care about from that point on is that there's a RootItem that contains "foo", as shown here:
再次,我是解析器/词法分析器概念的新手. 应该我(或者甚至有可能)编写语法,以便输出树与上面的图像匹配吗?还是应该在后续遍历AST并仅提取相关数据字段的后续步骤中解决这个问题?
Again, I'm painfully new to parser/lexer concepts. Should I (or, is it even possible to) write the grammar so the output tree matches the image above? Or should I take care of that in some subsequent step that traverses the AST and only extracts the relevant data fields?
推荐答案
ANTLR 4生成解析树,而不是AST.这是与ANTLR 3的行为的重要区别,它被选择来帮助长期维护语法.特别是,可能会出现以下情况:用户要做想要访问令牌,例如作为IDE中语义突出显示组件的一部分.在这种情况下,我们不是强迫用户编写针对特定应用的修改语法,而是选择始终将所有标记都包括在语法分析树中.
ANTLR 4 produces parse trees, not ASTs. This is an important distinction from the behavior of ANTLR 3, and was chosen to help with long-term maintenance of grammars. In particular, situations may arise where users do want access to the tokens, e.g. as part of a semantic highlighting component in an IDE. Rather than force users to write application-specific modified grammars in such a scenario, we chose to always include all tokens in the parse tree.
这篇关于antlr 4:所有这些令牌都应该显示在AST中吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!