Lexer前瞻如何与ANTLR3和ANTLR4中的贪婪和非贪婪匹配一起使用? [英] How Lexer lookahead works with greedy and non-greedy matching in ANTLR3 and ANTLR4?

查看:330
本文介绍了Lexer前瞻如何与ANTLR3和ANTLR4中的贪婪和非贪婪匹配一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果有人将我从前瞻性关系与令牌化(涉及贪婪/非贪婪匹配)背后的混乱中清除出来的想法,我会感到非常高兴.请注意,这是一篇篇幅较长的文章,因为它遵循了我的思考过程.

If someone would clear my mind from the confusion behind look-ahead relation to tokenizing involving greery/non-greedy matching i'd be more than glad. Be ware this is a slightly long post because it's following my thought process behind.

我正在尝试编写antlr3语法,以使我能够匹配输入,例如:

I'm trying to write antlr3 grammar that allows me to match input such as:

"identifierkeyword"

"identifierkeyword"

我在Antlr 3.4中想出了一种语法:

I came up with a grammar like so in Antlr 3.4:

KEYWORD: 'keyword' ;

IDENTIFIER
: 
  (options {greedy=false;}: (LOWCHAR|HIGHCHAR))+ 
;

/** lowercase letters */
fragment LOWCHAR
:   'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
:   'A'..'Z';

parse: IDENTIFIER KEYWORD EOF;

但是,它抱怨它永远无法以这种方式匹配IDENTIFIER,我对此并不了解. (以下替代方法永远无法匹配:1)

however it complains about it can never match IDENTIFIER this way, which i don't really understand. (The following alternatives can never be matched: 1)

基本上,我试图为尝试匹配(LOWCHAR | HIGHCHAR)非贪婪方式的词法分析器指定名称,以使其在KEYWORD前瞻处停止.到目前为止,我所读到的有关ANTLR词法分析器的知识应该以词法分析器规则的某种优先顺序为准.如果我在词法分析器语法中首先指定KEYWORD词法分析器规则,那么后面的任何词法分析器规则都将无法匹配消耗的字符.

Basically I was trying to specify for the lexer that try to match (LOWCHAR|HIGHCHAR) non-greedy way so it stops at KEYWORD lookahead. What i've read so far about ANTLR lexers that there supposed to be some kind of precedence of the lexer rules. If i specify KEYWORD lexer rule first in the lexer grammar, any lexer rules that come after shouldn't be able to match the consumed characters.

经过一番搜索,我了解到这里的问题是它无法正确地标记输入,因为例如对于输入:"identifierkeyword","identifier"部分首先出现,因此它决定开始匹配IDENTIFIER规则.还没有匹配的KEYWORD令牌.

After some searching I understand that problem here is that it can't tokenize the input the right way because for example for input: "identifierkeyword" the "identifier" part comes first so it decides to start matching the IDENTIFIER rule when there is no KEYWORD tokens matched yet.

然后,我尝试在ANTLR 4中编写相同的语法,以测试新的预运行功能是否可以满足我的需求,如下所示:

Then I tried to write the same grammar in ANTLR 4, to test if the new run-ahead capabilities can match what i want, it looks like this:

KEYWORD: 'keyword' ;

/** lowercase letters */
fragment LOWCHAR
:   'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
:   'A'..'Z';

IDENTIFIER
: 
  (LOWCHAR|HIGHCHAR)+?
;

parse: IDENTIFIER KEYWORD EOF;

对于输入:"identifierkeyword",它将产生此错误: 第1:1行不匹配的输入"d"期望关键字"

for the input: "identifierkeyword" it produces this error: line 1:1 mismatched input 'd' expecting 'keyword'

它与字符"i"(第一个字符)匹配,作为IDENTIFIER令牌,然后解析器期望一个KEYWORD令牌,但他不会这样.

it matches character 'i' (the very first character) as an IDENTIFIER token, and then the parser expects a KEYWORD token which he doesn't get this way.

在往前看还有其他可能性之前,词法分析器的非贪婪匹配是否应该匹配?它不应该为IDENTIFIER包含关键字并以这种方式进行匹配的可能性做预见吗?

Isn't the non-greedy matching for the lexer supposed to match till any other possibility is available in the look ahead? Shouldn't it look ahead for the possibility that an IDENTIFIER can contain a KEYWORD and match it that way?

我对此感到非常困惑,我观看了视频,其中Terence Parr介绍了ANTLR4的新功能,他在其中谈到了预运行线程,这些线程在实际匹配规则的同时一直监视所有正确"的解决方案.我认为这也适用于Lexer规则,其中标记化输入"identifierkeyword"的可能正确解决方案是匹配IDENTIFIER:"identifier"和匹配KEYWORD:"keyword"

I'm really confused about this, I have watched the video where Terence Parr introduces the new capabilities of ANTLR4 where he talks about run-ahead threads that watch for all "right" solutions till the end while actually matching a rule. I thought it would work for Lexer rules too, where a possible right solution for tokenizing input "identifierkeyword" is matching IDENTIFIER: "identifier" and matching KEYWORD: "keyword"

我认为我对非贪婪/贪婪匹配有很多错误.有人可以解释一下它是如何工作的吗?

I think I have lots of wrongs in my head about non-greedy/greedy matching. Could somebody please explain me how it works?

所有这些之后,我在这里找到了类似的问题: ANTLR尝试匹配较长令牌中的令牌,并做出与之相对应的语法:

After all this I've found a similar question here: ANTLR trying to match token within longer token and made a grammar corresponding to that:

parse
:   
  identifier 'keyword'
;

identifier
:   
  (HIGHCHAR | LOWCHAR)+
;

/** lowercase letters */
LOWCHAR
:   'a'..'z';
/** uppercase letters */
HIGHCHAR
:   'A'..'Z';

这就是我现在想要的,但是我看不到为什么不能将标识符规则更改为Lexer规则,而不能将LOWCHAR和HIGHCHAR更改为片段. 词法分析器不知道可以将关键字"中的字母作为标识符进行匹配吗?或相反亦然?抑或是仅将规则定义为内部具有先行能力,而不是所有可能的匹配语法?

This does what I want now, however I can't see why I can't change the identifier rule to a Lexer rule and LOWCHAR and HIGHCHAR to fragments. A Lexer doesn't know that letters in "keyword" can be matched as an identifier? or vice versa? Or maybe it is that rules are only defined to have a lookahead inside themselves, not all possible matching syntaxes?

推荐答案

在ANTLR 3和ANTLR 4中解决此问题的最简单方法是只允许IDENTIFIER匹配单个输入字符,然后创建解析器规则处理这些字符的顺序.

The easiest way to resolve this in both ANTLR 3 and ANTLR 4 is to only allow IDENTIFIER to match a single input character, and then create a parser rule to handle sequences of these characters.

identifier : IDENTIFIER+;
IDENTIFIER : HIGHCHAR | LOWCHAR;

这将导致词法分析器将输入的identifier跳过为10个单独的字符,然后将keyword读取为单个KEYWORD标记.

This would cause the lexer to skip the input identifier as 10 separate characters, and then read keyword as a single KEYWORD token.

您使用非贪婪运算符+?在ANTLR 4中观察到的行为与此类似.该运算符说在创建IDENTIFIER令牌的同时,尽可能少地匹配(HIGHCHAR|LOWCHAR)块".显然,创建令牌的最少数字是1,因此,这实际上是一种效率极低的写IDENTIFIER匹配单个字符的方式. parse规则未能解决此问题的原因是,它仅允许在KEYWORD令牌之前出现一个IDENTIFIER令牌.通过创建如上所示的解析器规则identifier,解析器将能够将IDENTIFIER标记序列(每个标记都是单个字符)当作单个标识符.

The behavior you observed in ANTLR 4 using the non-greedy operator +? is similar to this. This operator says "match as few (HIGHCHAR|LOWCHAR) blocks as possible while still creating an IDENTIFIER token". Clearly the fewest number to create the token is one, so this was effectively a highly inefficient way of writing IDENTIFIER to match a single character. The reason the parse rule failed to handle this is it only allows a single IDENTIFIER token to appear before the KEYWORD token. By creating a parser rule identifier like I showed above, the parser would be able to treat sequences of IDENTIFIER tokens (which are each a single character), as a single identifier.

编辑:在ANTLR 3中收到消息以下替代方案永远无法匹配..."的原因是静态分析已确定规则IDENTIFIER中的正闭包将永远不会匹配超过1个字符的 ,因为该规则始终会完全匹配 1个字符.

The reason you get the message "The following alternatives can never be matched..." in ANTLR 3 is the static analysis has determined that the positive closure in the rule IDENTIFIER will never match more than 1 character because the rule will always be successful with exactly 1 character.

这篇关于Lexer前瞻如何与ANTLR3和ANTLR4中的贪婪和非贪婪匹配一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆