Lexer 前瞻如何与 ANTLR3 和 ANTLR4 中的贪婪和非贪婪匹配一起使用? [英] How Lexer lookahead works with greedy and non-greedy matching in ANTLR3 and ANTLR4?
问题描述
如果有人能让我从前瞻关系与涉及贪婪/非贪婪匹配的标记化背后的困惑中清醒过来,我会非常高兴.请注意,这是一篇略长的帖子,因为它遵循了我的思考过程.
If someone would clear my mind from the confusion behind look-ahead relation to tokenizing involving greery/non-greedy matching i'd be more than glad. Be ware this is a slightly long post because it's following my thought process behind.
我正在尝试编写允许我匹配输入的 antlr3 语法,例如:
I'm trying to write antlr3 grammar that allows me to match input such as:
标识符关键字"
我想出了一个类似 Antlr 3.4 的语法:
I came up with a grammar like so in Antlr 3.4:
KEYWORD: 'keyword' ;
IDENTIFIER
:
(options {greedy=false;}: (LOWCHAR|HIGHCHAR))+
;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
parse: IDENTIFIER KEYWORD EOF;
但是它抱怨它永远无法以这种方式匹配 IDENTIFIER,我真的不明白.(以下替代方案永远无法匹配:1)
however it complains about it can never match IDENTIFIER this way, which i don't really understand. (The following alternatives can never be matched: 1)
基本上,我试图为尝试匹配 (LOWCHAR|HIGHCHAR) 非贪婪方式的词法分析器指定,以便它在 KEYWORD 前瞻处停止.到目前为止,我所读到的关于 ANTLR 词法分析器的内容应该是词法分析器规则的某种优先级.如果我在词法分析器语法中首先指定 KEYWORD 词法分析器规则,那么后面的任何词法分析器规则都不能匹配使用的字符.
Basically I was trying to specify for the lexer that try to match (LOWCHAR|HIGHCHAR) non-greedy way so it stops at KEYWORD lookahead. What i've read so far about ANTLR lexers that there supposed to be some kind of precedence of the lexer rules. If i specify KEYWORD lexer rule first in the lexer grammar, any lexer rules that come after shouldn't be able to match the consumed characters.
经过一番搜索,我明白这里的问题是它无法以正确的方式标记输入,因为例如对于输入:identifierkeyword",identifier"部分首先出现,因此它决定在出现时开始匹配 IDENTIFIER 规则还没有匹配的 KEYWORD 标记.
After some searching I understand that problem here is that it can't tokenize the input the right way because for example for input: "identifierkeyword" the "identifier" part comes first so it decides to start matching the IDENTIFIER rule when there is no KEYWORD tokens matched yet.
然后我尝试在 ANTLR 4 中编写相同的语法,以测试新的超前功能是否可以匹配我想要的,它看起来像这样:
Then I tried to write the same grammar in ANTLR 4, to test if the new run-ahead capabilities can match what i want, it looks like this:
KEYWORD: 'keyword' ;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
IDENTIFIER
:
(LOWCHAR|HIGHCHAR)+?
;
parse: IDENTIFIER KEYWORD EOF;
对于输入:identifierkeyword",它会产生以下错误:第 1:1 行不匹配的输入 'd' 需要 'keyword'
for the input: "identifierkeyword" it produces this error: line 1:1 mismatched input 'd' expecting 'keyword'
它匹配字符 'i'(第一个字符)作为 IDENTIFIER 标记,然后解析器需要一个他没有通过这种方式得到的 KEYWORD 标记.
it matches character 'i' (the very first character) as an IDENTIFIER token, and then the parser expects a KEYWORD token which he doesn't get this way.
词法分析器的非贪婪匹配是否应该匹配,直到有任何其他可能性在前瞻中可用?难道它不应该预见 IDENTIFIER 可以包含 KEYWORD 并以这种方式匹配的可能性吗?
Isn't the non-greedy matching for the lexer supposed to match till any other possibility is available in the look ahead? Shouldn't it look ahead for the possibility that an IDENTIFIER can contain a KEYWORD and match it that way?
我对此感到非常困惑,我观看了视频,其中 Terence Parr 介绍了 ANTLR4 的新功能,在该视频中,他谈到了在实际匹配规则的同时观察所有正确"解决方案直到最后的超前线程.我认为它也适用于词法分析器规则,其中标记输入identifierkeyword"的可能正确解决方案是匹配 IDENTIFIER:identifier"和匹配 KEYWORD:keyword"
I'm really confused about this, I have watched the video where Terence Parr introduces the new capabilities of ANTLR4 where he talks about run-ahead threads that watch for all "right" solutions till the end while actually matching a rule. I thought it would work for Lexer rules too, where a possible right solution for tokenizing input "identifierkeyword" is matching IDENTIFIER: "identifier" and matching KEYWORD: "keyword"
我认为我对非贪婪/贪婪匹配有很多误解.有人可以解释一下它是如何工作的吗?
I think I have lots of wrongs in my head about non-greedy/greedy matching. Could somebody please explain me how it works?
毕竟我在这里发现了一个类似的问题:ANTLR尝试在更长的标记中匹配标记并制作与之对应的语法:
After all this I've found a similar question here: ANTLR trying to match token within longer token and made a grammar corresponding to that:
parse
:
identifier 'keyword'
;
identifier
:
(HIGHCHAR | LOWCHAR)+
;
/** lowercase letters */
LOWCHAR
: 'a'..'z';
/** uppercase letters */
HIGHCHAR
: 'A'..'Z';
这就是我现在想要的,但是我不明白为什么我不能将标识符规则更改为 Lexer 规则并将 LOWCHAR 和 HIGHCHAR 更改为片段.词法分析器不知道关键字"中的字母可以作为标识符匹配吗?或相反亦然?或者可能是规则仅被定义为在其内部具有前瞻性,而不是所有可能的匹配语法?
This does what I want now, however I can't see why I can't change the identifier rule to a Lexer rule and LOWCHAR and HIGHCHAR to fragments. A Lexer doesn't know that letters in "keyword" can be matched as an identifier? or vice versa? Or maybe it is that rules are only defined to have a lookahead inside themselves, not all possible matching syntaxes?
推荐答案
在 ANTLR 3 和 ANTLR 4 中解决此问题的最简单方法是只允许 IDENTIFIER
匹配单个输入字符,并且然后创建一个解析器规则来处理这些字符的序列.
The easiest way to resolve this in both ANTLR 3 and ANTLR 4 is to only allow IDENTIFIER
to match a single input character, and then create a parser rule to handle sequences of these characters.
identifier : IDENTIFIER+;
IDENTIFIER : HIGHCHAR | LOWCHAR;
这会导致词法分析器将输入 identifier
作为 10 个单独的字符跳过,然后将 keyword
作为单个 KEYWORD
标记读取.
This would cause the lexer to skip the input identifier
as 10 separate characters, and then read keyword
as a single KEYWORD
token.
您在 ANTLR 4 中使用非贪婪运算符 +?
观察到的行为与此类似.这个操作符说匹配尽可能少的 (HIGHCHAR|LOWCHAR)
块,同时仍然创建一个 IDENTIFIER
令牌".显然,创建令牌的最少数字是一个,因此这实际上是编写 IDENTIFIER
以匹配单个字符的一种非常低效的方式.parse
规则未能处理这个问题的原因是它只允许一个 IDENTIFIER
标记出现在 KEYWORD
标记之前.通过像我上面展示的那样创建解析器规则 identifier
,解析器将能够将 IDENTIFIER
标记序列(每个标记为单个字符)视为单个标识符.
The behavior you observed in ANTLR 4 using the non-greedy operator +?
is similar to this. This operator says "match as few (HIGHCHAR|LOWCHAR)
blocks as possible while still creating an IDENTIFIER
token". Clearly the fewest number to create the token is one, so this was effectively a highly inefficient way of writing IDENTIFIER
to match a single character. The reason the parse
rule failed to handle this is it only allows a single IDENTIFIER
token to appear before the KEYWORD
token. By creating a parser rule identifier
like I showed above, the parser would be able to treat sequences of IDENTIFIER
tokens (which are each a single character), as a single identifier.
您在 ANTLR 3 中收到消息以下替代方案永远无法匹配..."的原因是静态分析已确定规则 IDENTIFIER 中的正闭包
永远不会匹配多于 1 个字符,因为规则总是在 恰好 1 个字符时成功.
The reason you get the message "The following alternatives can never be matched..." in ANTLR 3 is the static analysis has determined that the positive closure in the rule IDENTIFIER
will never match more than 1 character because the rule will always be successful with exactly 1 character.
这篇关于Lexer 前瞻如何与 ANTLR3 和 ANTLR4 中的贪婪和非贪婪匹配一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!