Lexer 前瞻如何与 ANTLR3 和 ANTLR4 中的贪婪和非贪婪匹配一起使用? [英] How Lexer lookahead works with greedy and non-greedy matching in ANTLR3 and ANTLR4?

查看:29
本文介绍了Lexer 前瞻如何与 ANTLR3 和 ANTLR4 中的贪婪和非贪婪匹配一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果有人能让我从前瞻关系与涉及贪婪/非贪婪匹配的标记化背后的困惑中清醒过来,我会非常高兴.请注意,这是一篇略长的帖子,因为它遵循了我的思考过程.

If someone would clear my mind from the confusion behind look-ahead relation to tokenizing involving greery/non-greedy matching i'd be more than glad. Be ware this is a slightly long post because it's following my thought process behind.

我正在尝试编写允许我匹配输入的 antlr3 语法,例如:

I'm trying to write antlr3 grammar that allows me to match input such as:

标识符关键字"

我想出了一个类似 Antlr 3.4 的语法:

I came up with a grammar like so in Antlr 3.4:

KEYWORD: 'keyword' ;

IDENTIFIER
: 
  (options {greedy=false;}: (LOWCHAR|HIGHCHAR))+ 
;

/** lowercase letters */
fragment LOWCHAR
:   'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
:   'A'..'Z';

parse: IDENTIFIER KEYWORD EOF;

但是它抱怨它永远无法以这种方式匹配 IDENTIFIER,我真的不明白.(以下替代方案永远无法匹配:1)

however it complains about it can never match IDENTIFIER this way, which i don't really understand. (The following alternatives can never be matched: 1)

基本上,我试图为尝试匹配 (LOWCHAR|HIGHCHAR) 非贪婪方式的词法分析器指定,以便它在 KEYWORD 前瞻处停止.到目前为止,我所读到的关于 ANTLR 词法分析器的内容应该是词法分析器规则的某种优先级.如果我在词法分析器语法中首先指定 KEYWORD 词法分析器规则,那么后面的任何词法分析器规则都不能匹配使用的字符.

Basically I was trying to specify for the lexer that try to match (LOWCHAR|HIGHCHAR) non-greedy way so it stops at KEYWORD lookahead. What i've read so far about ANTLR lexers that there supposed to be some kind of precedence of the lexer rules. If i specify KEYWORD lexer rule first in the lexer grammar, any lexer rules that come after shouldn't be able to match the consumed characters.

经过一番搜索,我明白这里的问题是它无法以正确的方式标记输入,因为例如对于输入:identifierkeyword",identifier"部分首先出现,因此它决定在出现时开始匹配 IDENTIFIER 规则还没有匹配的 KEYWORD 标记.

After some searching I understand that problem here is that it can't tokenize the input the right way because for example for input: "identifierkeyword" the "identifier" part comes first so it decides to start matching the IDENTIFIER rule when there is no KEYWORD tokens matched yet.

然后我尝试在 ANTLR 4 中编写相同的语法,以测试新的超前功能是否可以匹配我想要的,它看起来像这样:

Then I tried to write the same grammar in ANTLR 4, to test if the new run-ahead capabilities can match what i want, it looks like this:

KEYWORD: 'keyword' ;

/** lowercase letters */
fragment LOWCHAR
:   'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
:   'A'..'Z';

IDENTIFIER
: 
  (LOWCHAR|HIGHCHAR)+?
;

parse: IDENTIFIER KEYWORD EOF;

对于输入:identifierkeyword",它会产生以下错误:第 1:1 行不匹配的输入 'd' 需要 'keyword'

for the input: "identifierkeyword" it produces this error: line 1:1 mismatched input 'd' expecting 'keyword'

它匹配字符 'i'(第一个字符)作为 IDENTIFIER 标记,然后解析器需要一个他没有通过这种方式得到的 KEYWORD 标记.

it matches character 'i' (the very first character) as an IDENTIFIER token, and then the parser expects a KEYWORD token which he doesn't get this way.

词法分析器的非贪婪匹配是否应该匹配,直到有任何其他可能性在前瞻中可用?难道它不应该预见 IDENTIFIER 可以包含 KEYWORD 并以这种方式匹配的可能性吗?

Isn't the non-greedy matching for the lexer supposed to match till any other possibility is available in the look ahead? Shouldn't it look ahead for the possibility that an IDENTIFIER can contain a KEYWORD and match it that way?

我对此感到非常困惑,我观看了视频,其中 Terence Parr 介绍了 ANTLR4 的新功能,在该视频中,他谈到了在实际匹配规则的同时观察所有正确"解决方案直到最后的超前线程.我认为它也适用于词法分析器规则,其中标记输入identifierkeyword"的可能正确解决方案是匹配 IDENTIFIER:identifier"和匹配 KEYWORD:keyword"

I'm really confused about this, I have watched the video where Terence Parr introduces the new capabilities of ANTLR4 where he talks about run-ahead threads that watch for all "right" solutions till the end while actually matching a rule. I thought it would work for Lexer rules too, where a possible right solution for tokenizing input "identifierkeyword" is matching IDENTIFIER: "identifier" and matching KEYWORD: "keyword"

我认为我对非贪婪/贪婪匹配有很多误解.有人可以解释一下它是如何工作的吗?

I think I have lots of wrongs in my head about non-greedy/greedy matching. Could somebody please explain me how it works?

毕竟我在这里发现了一个类似的问题:ANTLR尝试在更长的标记中匹配标记并制作与之对应的语法:

After all this I've found a similar question here: ANTLR trying to match token within longer token and made a grammar corresponding to that:

parse
:   
  identifier 'keyword'
;

identifier
:   
  (HIGHCHAR | LOWCHAR)+
;

/** lowercase letters */
LOWCHAR
:   'a'..'z';
/** uppercase letters */
HIGHCHAR
:   'A'..'Z';

这就是我现在想要的,但是我不明白为什么我不能将标识符规则更改为 Lexer 规则并将 LOWCHAR 和 HIGHCHAR 更改为片段.词法分析器不知道关键字"中的字母可以作为标识符匹配吗?或相反亦然?或者可能是规则仅被定义为在其内部具有前瞻性,而不是所有可能的匹配语法?

This does what I want now, however I can't see why I can't change the identifier rule to a Lexer rule and LOWCHAR and HIGHCHAR to fragments. A Lexer doesn't know that letters in "keyword" can be matched as an identifier? or vice versa? Or maybe it is that rules are only defined to have a lookahead inside themselves, not all possible matching syntaxes?

推荐答案

在 ANTLR 3 和 ANTLR 4 中解决此问题的最简单方法是只允许 IDENTIFIER 匹配单个输入字符,并且然后创建一个解析器规则来处理这些字符的序列.

The easiest way to resolve this in both ANTLR 3 and ANTLR 4 is to only allow IDENTIFIER to match a single input character, and then create a parser rule to handle sequences of these characters.

identifier : IDENTIFIER+;
IDENTIFIER : HIGHCHAR | LOWCHAR;

这会导致词法分析器将输入 identifier 作为 10 个单独的字符跳过,然后将 keyword 作为单个 KEYWORD 标记读取.

This would cause the lexer to skip the input identifier as 10 separate characters, and then read keyword as a single KEYWORD token.

您在 ANTLR 4 中使用非贪婪运算符 +? 观察到的行为与此类似.这个操作符说匹配尽可能少的 (HIGHCHAR|LOWCHAR) 块,同时仍然创建一个 IDENTIFIER 令牌".显然,创建令牌的最少数字是一个,因此这实际上是编写 IDENTIFIER 以匹配单个字符的一种非常低效的方式.parse 规则未能处理这个问题的原因是它只允许一个 IDENTIFIER 标记出现在 KEYWORD 标记之前.通过像我上面展示的那样创建解析器规则 identifier,解析器将能够将 IDENTIFIER 标记序列(每个标记为单个字符)视为单个标识符.

The behavior you observed in ANTLR 4 using the non-greedy operator +? is similar to this. This operator says "match as few (HIGHCHAR|LOWCHAR) blocks as possible while still creating an IDENTIFIER token". Clearly the fewest number to create the token is one, so this was effectively a highly inefficient way of writing IDENTIFIER to match a single character. The reason the parse rule failed to handle this is it only allows a single IDENTIFIER token to appear before the KEYWORD token. By creating a parser rule identifier like I showed above, the parser would be able to treat sequences of IDENTIFIER tokens (which are each a single character), as a single identifier.

您在 ANTLR 3 中收到消息以下替代方案永远无法匹配..."的原因是静态分析已确定规则 IDENTIFIER 中的正闭包 永远不会匹配多于 1 个字符,因为规则总是在 恰好 1 个字符时成功.

The reason you get the message "The following alternatives can never be matched..." in ANTLR 3 is the static analysis has determined that the positive closure in the rule IDENTIFIER will never match more than 1 character because the rule will always be successful with exactly 1 character.

这篇关于Lexer 前瞻如何与 ANTLR3 和 ANTLR4 中的贪婪和非贪婪匹配一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆