"IDENTIFIER"规则还使用ANTLR Lexer语法中的关键字 [英] 'IDENTIFIER' rule also consumes keyword in ANTLR Lexer grammar
问题描述
使用Antlr 3.5语法进行Java解析时,注意到" IDENTIFIER "规则在ANTLR Lexer语法中消耗的关键字很少. Lexer语法是
While working on Antlr 3.5 grammar for Java parsing noticed that 'IDENTIFIER' rule consumes few Keywords in ANTLR Lexer grammar. The Lexer grammar is
lexer grammar JavaLexer;
options {
//k=8;
language=Java;
filter=true;
//backtrack=true;
}
@lexer::header {
package java;
}
@lexer::members {
public ArrayList<String> keywordsList = new ArrayList<String>();
}
V_DECLARATION
:
( ((MODIFIERS)=>tok1=MODIFIERS WS+)? tok2=TYPE WS+ var=V_DECLARATOR WS* )
{...};
fragment
V_DECLARATOR
:
(
tok=IDENTIFIER WS* ( ',' | ';' | ASSIGN WS* V_VALUE )
)
{...};
fragment
V_VALUE
: (IDENTIFIER (DOT WS* IDENTIFIER WS* '(' | ',' | ';'))
;
MODIFIERS
:
(PUBLIC | PRIVATE | FINAL)+
;
PRIVATE
: tok = 'private'
{ keywordsList.add($tok.getText()); }
;
PUBLIC
: tok = 'public'
{ keywordsList.add($tok.getText()); }
;
DOT
: '.'
{ keywordsList.add("."); }
;
THIS
: tok = 'this'
{ keywordsList.add($tok.getText()); }
;
ASSIGN
: '='
{ keywordsList.add("="); }
;
IDENTIFIER:
tok =Identifier
{
//System.out.println("Identifier: " + $tok.text);
}
;
fragment
Identifier
: (Letter (Letter|JavaIDDigit)*);
fragment
Letter
: '\u0024' |
'\u0041'..'\u005a' |
'\u005f' |
'\u0061'..'\u007a' |
'\u00c0'..'\u00d6' |
'\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' |
'\u0100'..'\u1fff' |
'\u3040'..'\u318f' |
'\u3300'..'\u337f' |
'\u3400'..'\u3d2d' |
'\u4e00'..'\u9fff' |
'\uf900'..'\ufaff'
;
fragment
JavaIDDigit
: '\u0030'..'\u0039' |
'\u0660'..'\u0669' |
'\u06f0'..'\u06f9' |
'\u0966'..'\u096f' |
'\u09e6'..'\u09ef' |
'\u0a66'..'\u0a6f' |
'\u0ae6'..'\u0aef' |
'\u0b66'..'\u0b6f' |
'\u0be7'..'\u0bef' |
'\u0c66'..'\u0c6f' |
'\u0ce6'..'\u0cef' |
'\u0d66'..'\u0d6f' |
'\u0e50'..'\u0e59' |
'\u0ed0'..'\u0ed9' |
'\u1040'..'\u1049'
;
WS : (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN; skip();}
;
当我尝试解析该行时:
public final int inch = this.getValue();
然后,规则'VAR_VALUE-> IDENTIFIER'也使用了"this" 关键字,这是不希望的,因为关键字也被收集到一个单独的列表中.
Then the rule 'VAR_VALUE -> IDENTIFIER', also consumes the "this" keyword, which is undesirable, since keywords also be collected into a separate list.
Antlr语法中是否有任何技巧/规定可以按其自身规则匹配关键字而不影响"IDENTIFIER"之类的其他功能?
Is there any trick/provision in Antlr grammar to match the keywords by itself rule without effecting the other functionality like "IDENTIFIER"?
推荐答案
您的问题确实是由于对lexer中的内容和解析器中的内容的误解造成的:
Your problem is indeed caused by the misunderstanding of what belongs in lexer and what belongs in parser:
- Lexer的工作是确定字符流所代表的单词
- 例如
this
是THIS
,0
是NUMBER
和that
是IDENTIFIER
- Lexer's job is to determine which words the stream of characters represent
- e.g. that
this
is aTHIS
,0
is aNUMBER
andthat
is anIDENTIFIER
- 例如该声明由可能的修饰符,类型和标识符列表组成
由于lexer的工作是确定输入中包含哪些单词,因此它会处理输入并寻找最长有效匹配项(在ANTLR中,如果两个或多个规则接受相同的输入,则最上面的一个源语法胜出).不是针对任何最具体的",而是最长的.
Since lexer's job is to determine which words are on the input, it processes the input and looks for longest valid match (in ANTLR, if two or more rules accept same input, the topmost one in source grammar wins). Not for any "most specific", but simply the longest one.
示例:
- 输入
t
- 可以是
THIS
或IDENTIFIER
- Input
t
- Can be
THIS
orIDENTIFIER
- 仍然可以是
THIS
或IDENTIFIER
- 不能再为
THIS
,只能为IDENTIFIER
- Can no longer be
THIS
, onlyIDENTIFIER
is possible
-
IDENTIFIER
肯定
- 不再匹配
IDENTIFIER
,因此that
将被匹配为IDENTIFIER
,最后一个输入.
将被匹配作为下一个令牌的新起点
- No longer matches
IDENTIFIER
, sothat
will be matched asIDENTIFIER
and the last input.
will be matched as a new start of next token
另一个例子:
- 输入
t
,h
,i
,s
- 可以始终与
THIS
或IDENTIFIER
匹配
- Input
t
,h
,i
,s
- Can be matched as either
THIS
orIDENTIFIER
whole time
- 无法再匹配任何内容,因此
this
将被匹配为THIS
(最高匹配规则),而不是IDENTIFIER
,并且.
将开始一个新令牌
- Can no longer be matched by anything, so
this
will be matched asTHIS
(topmost matching rule) rather thanIDENTIFIER
and.
will start a new token
现在到重要的部分-,只要从另一个lexer规则引用了lexer规则,它就被认为只是引用lexer规则的一部分.这意味着匹配不会发出新的令牌,也不会在片段匹配结束时触发多个匹配令牌之间的任何决定.由于
this
确实可以与IDENTIFIER
规则匹配,因此整个声明都符合V_DECLARATION
lexer规则-因此,除非另有另一个 lexer 规则可以匹配至少相同长度的输入和在语法上早于此规则,将适用此规则.And now to the important part - as long as a lexer rule is referenced from another lexer rule, it's considered to be merely a fragment of the referencing lexer rule. This means that matching it won't emit a new token, and also that it won't trigger any decisions between multiple matching tokens at the end of the fragment's match. Since
this
can indeed be matched byIDENTIFIER
rule, the whole declaration conforms to theV_DECLARATION
lexer rule - so unless there's another lexer rule that can match at least the same length of input and is earlier in the grammar than this rule, this rule will apply.您没有提供任何引用
THIS
的规则,所以我们不知道它在语法中的表现如何,但是显而易见的原因是词法分析器可以比使用规则. You didn't provide any rule referencing
THIS
so we don't know how exactly this plays out in your grammar, but the obvious cause is that lexer can match longer input or with earlier rule than anything that usesTHIS
rule.这篇关于"IDENTIFIER"规则还使用ANTLR Lexer语法中的关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- Can be matched as either
- 可以始终与
- Can be
- 可以是
- e.g. that
- 例如