ANTLR4词法分析器无法解决语法顺序中的歧义 [英] ANTLR4 lexer not resolving ambiguity in grammar order

查看:258
本文介绍了ANTLR4词法分析器无法解决语法顺序中的歧义的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用ANTLR 4.2,我正在尝试对该测试数据进行非常简单的解析:

Using ANTLR 4.2, I'm trying a very simple parse of this test data:

RRV0#ABC

使用最小语法:

grammar Tiny;

thing : RRV N HASH ID ;

RRV : 'RRV' ;
N : [0-9]+ ;
HASH : '#' ;
ID : [a-zA-Z0-9]+ ;
WS : [\t\r\n]+ -> skip ; // match 1-or-more whitespace but discard

根据以下Terence Parr的权威ANTLR 4参考摘录,我希望词法分析器RRV在ID之前匹配:

I expect the lexer RRV to match before ID, based on the excerpt below from Terence Parr's Definitive ANTLR 4 reference:

BEGIN : 'begin' ; // match b-e-g-i-n sequence; ambiguity resolves to BEGIN
ID : [a-z]+ ; // match one or more of any lowercase letter

使用上面的测试数据运行ANTLR4测试平台,输出为

Running the ANTLR4 test rig with the test data above, the output is

[@0,0:3='RRV0',<4>,1:0]
[@1,4:4='#',<3>,1:4]
[@2,5:7='ABC',<4>,1:5]
[@3,10:9='<EOF>',<-1>,2:0]
line 1:0 mismatched input 'RRV0' expecting 'RRV'

我可以看到ID的第一个令牌是< 4>,其值为'RRV0'

I can see the first token is <4> for ID, with the value 'RRV0'

我尝试重新排列词法器项目顺序.我还尝试通过在语法规则中进行显式匹配来使用隐式词法分析器项(而不是通过显式词法分析器项).我也尝试过使比赛不贪心.这些对我来说并不成功.

I have tried rearranging the lexer item order. I have also tried using implicit lexer items by explicitly matching in the grammar rule (rather than through an explicit lexer item). I tried making matches non greedy too. Those were not successful for me.

如果我将词法ID项更改为不匹配大写字母,则RRV项确实匹配,并且解析将进一步进行.

If I change the lexed ID item to not match upper case then the RRV item does match and the parse will get further.

我在ANTLR 4.1中也遇到了同样的问题.

I started in ANTLR 4.1 with the same issue.

我在ANTLRWorks中和从命令行中进行了检查,两种方法的结果相同.

I checked in ANTLRWorks and from the command line, with the same result both ways.

如何更改语法以使词法分析器项目RRV优先于ID?

How can I change the grammar to match lexer item RRV in preference to ID ?

推荐答案

仅当两个不同的词法分析器规则匹配相同长度的标记时,才适用语法顺序解析策略.当长度不同时,最长的总是赢.在您的情况下,ID规则将匹配长度为4的令牌,该令牌的长度大于仅匹配3个字符的RRV令牌.

The grammar order resolution policy only applies when two different lexer rules match the same length of token. When the length differs, the longest one always wins. In your case, the ID rule matches a token with length 4, which is longer than the RRV token that only matches 3 characters.

此策略在Java之类的语言中尤其重要.考虑以下输入:

This strategy is especially important in languages like Java. Consider the following input:

String className = "";

附带以下两个语法规则(略有简化):

Along with the following two grammar rules (slightly simplified):

CLASS : 'class';
ID : [a-zA-Z_] [a-zA-Z0-9_]*;

如果仅考虑语法顺序,则输入className将产生一个关键字,后跟标识符Name.重新排列规则并不能解决问题,因为那样就无法创建CLASS令牌,即使对于输入class也是如此.

If we only considered grammar order, then the input className would produce a keyword followed by the identifier Name. Rearranging the rules wouldn't solve the problem because then there would be no way to ever create a CLASS token, even for the input class.

这篇关于ANTLR4词法分析器无法解决语法顺序中的歧义的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆