Lexer前瞻如何与ANTLR3和ANTLR4中的贪婪和非贪婪匹配一起使用? [英] How Lexer lookahead works with greedy and non-greedy matching in ANTLR3 and ANTLR4?
问题描述
如果有人将我从前瞻性关系与令牌化(涉及贪婪/非贪婪匹配)背后的混乱中清除出来的想法,我会感到非常高兴.请注意,这是一篇篇幅较长的文章,因为它遵循了我的思考过程.
If someone would clear my mind from the confusion behind look-ahead relation to tokenizing involving greery/non-greedy matching i'd be more than glad. Be ware this is a slightly long post because it's following my thought process behind.
我正在尝试编写antlr3语法,以使我能够匹配输入,例如:
I'm trying to write antlr3 grammar that allows me to match input such as:
"identifierkeyword"
"identifierkeyword"
我在Antlr 3.4中想出了一种语法:
I came up with a grammar like so in Antlr 3.4:
KEYWORD: 'keyword' ;
IDENTIFIER
:
(options {greedy=false;}: (LOWCHAR|HIGHCHAR))+
;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
parse: IDENTIFIER KEYWORD EOF;
但是,它抱怨它永远无法以这种方式匹配IDENTIFIER,我对此并不了解. (以下替代方法永远无法匹配:1)
however it complains about it can never match IDENTIFIER this way, which i don't really understand. (The following alternatives can never be matched: 1)
基本上,我试图为尝试匹配(LOWCHAR | HIGHCHAR)非贪婪方式的词法分析器指定名称,以使其在KEYWORD前瞻处停止.到目前为止,我所读到的有关ANTLR词法分析器的知识应该以词法分析器规则的某种优先顺序为准.如果我在词法分析器语法中首先指定KEYWORD词法分析器规则,那么后面的任何词法分析器规则都将无法匹配消耗的字符.
Basically I was trying to specify for the lexer that try to match (LOWCHAR|HIGHCHAR) non-greedy way so it stops at KEYWORD lookahead. What i've read so far about ANTLR lexers that there supposed to be some kind of precedence of the lexer rules. If i specify KEYWORD lexer rule first in the lexer grammar, any lexer rules that come after shouldn't be able to match the consumed characters.
经过一番搜索,我了解到这里的问题是它无法正确地标记输入,因为例如对于输入:"identifierkeyword","identifier"部分首先出现,因此它决定开始匹配IDENTIFIER规则.还没有匹配的KEYWORD令牌.
After some searching I understand that problem here is that it can't tokenize the input the right way because for example for input: "identifierkeyword" the "identifier" part comes first so it decides to start matching the IDENTIFIER rule when there is no KEYWORD tokens matched yet.
然后,我尝试在ANTLR 4中编写相同的语法,以测试新的预运行功能是否可以满足我的需求,如下所示:
Then I tried to write the same grammar in ANTLR 4, to test if the new run-ahead capabilities can match what i want, it looks like this:
KEYWORD: 'keyword' ;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
IDENTIFIER
:
(LOWCHAR|HIGHCHAR)+?
;
parse: IDENTIFIER KEYWORD EOF;
对于输入:"identifierkeyword",它将产生此错误: 第1:1行不匹配的输入"d"期望关键字"
for the input: "identifierkeyword" it produces this error: line 1:1 mismatched input 'd' expecting 'keyword'
它与字符"i"(第一个字符)匹配,作为IDENTIFIER令牌,然后解析器期望一个KEYWORD令牌,但他不会这样.
it matches character 'i' (the very first character) as an IDENTIFIER token, and then the parser expects a KEYWORD token which he doesn't get this way.
在往前看还有其他可能性之前,词法分析器的非贪婪匹配是否应该匹配?它不应该为IDENTIFIER包含关键字并以这种方式进行匹配的可能性做预见吗?
Isn't the non-greedy matching for the lexer supposed to match till any other possibility is available in the look ahead? Shouldn't it look ahead for the possibility that an IDENTIFIER can contain a KEYWORD and match it that way?
我对此感到非常困惑,我观看了视频,其中Terence Parr介绍了ANTLR4的新功能,他在其中谈到了预运行线程,这些线程在实际匹配规则的同时一直监视所有正确"的解决方案.我认为这也适用于Lexer规则,其中标记化输入"identifierkeyword"的可能正确解决方案是匹配IDENTIFIER:"identifier"和匹配KEYWORD:"keyword"
I'm really confused about this, I have watched the video where Terence Parr introduces the new capabilities of ANTLR4 where he talks about run-ahead threads that watch for all "right" solutions till the end while actually matching a rule. I thought it would work for Lexer rules too, where a possible right solution for tokenizing input "identifierkeyword" is matching IDENTIFIER: "identifier" and matching KEYWORD: "keyword"
我认为我对非贪婪/贪婪匹配有很多错误.有人可以解释一下它是如何工作的吗?
I think I have lots of wrongs in my head about non-greedy/greedy matching. Could somebody please explain me how it works?
所有这些之后,我在这里找到了类似的问题: ANTLR尝试匹配较长令牌中的令牌,并做出与之相对应的语法:
After all this I've found a similar question here: ANTLR trying to match token within longer token and made a grammar corresponding to that:
parse
:
identifier 'keyword'
;
identifier
:
(HIGHCHAR | LOWCHAR)+
;
/** lowercase letters */
LOWCHAR
: 'a'..'z';
/** uppercase letters */
HIGHCHAR
: 'A'..'Z';
这就是我现在想要的,但是我看不到为什么不能将标识符规则更改为Lexer规则,而不能将LOWCHAR和HIGHCHAR更改为片段. 词法分析器不知道可以将关键字"中的字母作为标识符进行匹配吗?或相反亦然?抑或是仅将规则定义为内部具有先行能力,而不是所有可能的匹配语法?
This does what I want now, however I can't see why I can't change the identifier rule to a Lexer rule and LOWCHAR and HIGHCHAR to fragments. A Lexer doesn't know that letters in "keyword" can be matched as an identifier? or vice versa? Or maybe it is that rules are only defined to have a lookahead inside themselves, not all possible matching syntaxes?
推荐答案
在ANTLR 3和ANTLR 4中解决此问题的最简单方法是只允许IDENTIFIER
匹配单个输入字符,然后创建解析器规则处理这些字符的顺序.
The easiest way to resolve this in both ANTLR 3 and ANTLR 4 is to only allow IDENTIFIER
to match a single input character, and then create a parser rule to handle sequences of these characters.
identifier : IDENTIFIER+;
IDENTIFIER : HIGHCHAR | LOWCHAR;
这将导致词法分析器将输入的identifier
跳过为10个单独的字符,然后将keyword
读取为单个KEYWORD
标记.
This would cause the lexer to skip the input identifier
as 10 separate characters, and then read keyword
as a single KEYWORD
token.
您使用非贪婪运算符+?
在ANTLR 4中观察到的行为与此类似.该运算符说在创建IDENTIFIER
令牌的同时,尽可能少地匹配(HIGHCHAR|LOWCHAR)
块".显然,创建令牌的最少数字是1,因此,这实际上是一种效率极低的写IDENTIFIER
匹配单个字符的方式. parse
规则未能解决此问题的原因是,它仅允许在KEYWORD
令牌之前出现一个IDENTIFIER
令牌.通过创建如上所示的解析器规则identifier
,解析器将能够将IDENTIFIER
标记序列(每个标记都是单个字符)当作单个标识符.
The behavior you observed in ANTLR 4 using the non-greedy operator +?
is similar to this. This operator says "match as few (HIGHCHAR|LOWCHAR)
blocks as possible while still creating an IDENTIFIER
token". Clearly the fewest number to create the token is one, so this was effectively a highly inefficient way of writing IDENTIFIER
to match a single character. The reason the parse
rule failed to handle this is it only allows a single IDENTIFIER
token to appear before the KEYWORD
token. By creating a parser rule identifier
like I showed above, the parser would be able to treat sequences of IDENTIFIER
tokens (which are each a single character), as a single identifier.
编辑:在ANTLR 3中收到消息以下替代方案永远无法匹配..."的原因是静态分析已确定规则IDENTIFIER
中的正闭包将永远不会匹配超过1个字符的 ,因为该规则始终会完全匹配 1个字符.
The reason you get the message "The following alternatives can never be matched..." in ANTLR 3 is the static analysis has determined that the positive closure in the rule IDENTIFIER
will never match more than 1 character because the rule will always be successful with exactly 1 character.
这篇关于Lexer前瞻如何与ANTLR3和ANTLR4中的贪婪和非贪婪匹配一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!