"IDENTIFIER"规则还使用ANTLR Lexer语法中的关键字 [英] 'IDENTIFIER' rule also consumes keyword in ANTLR Lexer grammar

查看:235
本文介绍了"IDENTIFIER"规则还使用ANTLR Lexer语法中的关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Antlr 3.5语法进行Java解析时,注意到" IDENTIFIER "规则在ANTLR Lexer语法中消耗的关键字很少. Lexer语法是

While working on Antlr 3.5 grammar for Java parsing noticed that 'IDENTIFIER' rule consumes few Keywords in ANTLR Lexer grammar. The Lexer grammar is

lexer grammar JavaLexer;

options {
   //k=8;
   language=Java;
   filter=true;
   //backtrack=true;
}

@lexer::header {
package java;
}

@lexer::members {
public ArrayList<String> keywordsList = new ArrayList<String>();
}

V_DECLARATION
:
( ((MODIFIERS)=>tok1=MODIFIERS WS+)? tok2=TYPE WS+ var=V_DECLARATOR WS* )
{...};

fragment
V_DECLARATOR
  :
  (
    tok=IDENTIFIER WS* ( ',' | ';' | ASSIGN WS* V_VALUE )
  )
  {...};

fragment
V_VALUE
: (IDENTIFIER (DOT WS* IDENTIFIER WS* '(' | ',' | ';'))
;

MODIFIERS
  :
  (PUBLIC | PRIVATE | FINAL)+
;

PRIVATE
    :    tok = 'private'
    { keywordsList.add($tok.getText());  }
    ;

PUBLIC
    :    tok = 'public'
    { keywordsList.add($tok.getText()); }
    ;

DOT
    :    '.'
    { keywordsList.add("."); }
    ;

THIS
    :    tok = 'this'
    { keywordsList.add($tok.getText()); }
    ;

ASSIGN
    :    '='
        { keywordsList.add("="); }
    ;    

IDENTIFIER:
  tok =Identifier
  {  
   //System.out.println("Identifier: " + $tok.text);
  }
  ;  

fragment
Identifier 
    :   (Letter (Letter|JavaIDDigit)*);

fragment
Letter
    :  '\u0024' |
       '\u0041'..'\u005a' |
       '\u005f' |
       '\u0061'..'\u007a' |
       '\u00c0'..'\u00d6' |
       '\u00d8'..'\u00f6' |
       '\u00f8'..'\u00ff' |
       '\u0100'..'\u1fff' |
       '\u3040'..'\u318f' |
       '\u3300'..'\u337f' |
       '\u3400'..'\u3d2d' |
       '\u4e00'..'\u9fff' |
       '\uf900'..'\ufaff'
    ;

fragment
JavaIDDigit
    :  '\u0030'..'\u0039' |
       '\u0660'..'\u0669' |
       '\u06f0'..'\u06f9' |
       '\u0966'..'\u096f' |
       '\u09e6'..'\u09ef' |
       '\u0a66'..'\u0a6f' |
       '\u0ae6'..'\u0aef' |
       '\u0b66'..'\u0b6f' |
       '\u0be7'..'\u0bef' |
       '\u0c66'..'\u0c6f' |
       '\u0ce6'..'\u0cef' |
       '\u0d66'..'\u0d6f' |
       '\u0e50'..'\u0e59' |
       '\u0ed0'..'\u0ed9' |
       '\u1040'..'\u1049'
   ;

WS  :  (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN; skip();}
    ;

当我尝试解析该行时:

public final int inch = this.getValue();

然后,规则'VAR_VALUE-> IDENTIFIER'也使用了"this" 关键字,这是不希望的,因为关键字也被收集到一个单独的列表中.

Then the rule 'VAR_VALUE -> IDENTIFIER', also consumes the "this" keyword, which is undesirable, since keywords also be collected into a separate list.

Antlr语法中是否有任何技巧/规定可以按其自身规则匹配关键字而不影响"IDENTIFIER"之类的其他功能?

Is there any trick/provision in Antlr grammar to match the keywords by itself rule without effecting the other functionality like "IDENTIFIER"?

推荐答案

您的问题确实是由于对lexer中的内容和解析器中的内容的误解造成的:

Your problem is indeed caused by the misunderstanding of what belongs in lexer and what belongs in parser:

  • Lexer的工作是确定字符流所代表的单词
    • 例如thisTHIS0NUMBERthatIDENTIFIER
    • Lexer's job is to determine which words the stream of characters represent
      • e.g. that this is a THIS, 0 is a NUMBER and that is an IDENTIFIER
      • 例如该声明由可能的修饰符,类型和标识符列表组成

      由于lexer的工作是确定输入中包含哪些单词,因此它会处理输入并寻找最长有效匹配项(在ANTLR中,如果两个或多个规则接受相同的输入,则最上面的一个源语法胜出).不是针对任何最具体的",而是最长的.

      Since lexer's job is to determine which words are on the input, it processes the input and looks for longest valid match (in ANTLR, if two or more rules accept same input, the topmost one in source grammar wins). Not for any "most specific", but simply the longest one.

      示例:

      • 输入t
        • 可以是THISIDENTIFIER
        • Input t
          • Can be THIS or IDENTIFIER
          • 仍然可以是THISIDENTIFIER
          • 不能再为THIS,只能为IDENTIFIER
          • Can no longer be THIS, only IDENTIFIER is possible
          • IDENTIFIER肯定
          • 不再匹配IDENTIFIER,因此that将被匹配为IDENTIFIER,最后一个输入.将被匹配作为下一个令牌的新起点
          • No longer matches IDENTIFIER, so that will be matched as IDENTIFIER and the last input . will be matched as a new start of next token

          另一个例子:

          • 输入this
            • 可以始终与THISIDENTIFIER匹配
            • Input t, h, i, s
              • Can be matched as either THIS or IDENTIFIER whole time
              • 无法再匹配任何内容,因此this将被匹配为THIS(最高匹配规则),而不是IDENTIFIER,并且.将开始一个新令牌
              • Can no longer be matched by anything, so this will be matched as THIS (topmost matching rule) rather than IDENTIFIER and . will start a new token

              现在到重要的部分-,只要从另一个lexer规则引用了lexer规则,它就被认为只是引用lexer规则的一部分.这意味着匹配不会发出新的令牌,也不会在片段匹配结束时触发多个匹配令牌之间的任何决定.由于this确实可以与IDENTIFIER规则匹配,因此整个声明都符合V_DECLARATION lexer规则-因此,除非另有另一个 lexer 规则可以匹配至少相同长度的输入在语法上早于此规则,将适用此规则.

              And now to the important part - as long as a lexer rule is referenced from another lexer rule, it's considered to be merely a fragment of the referencing lexer rule. This means that matching it won't emit a new token, and also that it won't trigger any decisions between multiple matching tokens at the end of the fragment's match. Since this can indeed be matched by IDENTIFIER rule, the whole declaration conforms to the V_DECLARATION lexer rule - so unless there's another lexer rule that can match at least the same length of input and is earlier in the grammar than this rule, this rule will apply.

              您没有提供任何引用THIS的规则,所以我们不知道它在语法中的表现如何,但是显而易见的原因是词法分析器可以比使用规则.

              You didn't provide any rule referencing THIS so we don't know how exactly this plays out in your grammar, but the obvious cause is that lexer can match longer input or with earlier rule than anything that uses THIS rule.

              这篇关于"IDENTIFIER"规则还使用ANTLR Lexer语法中的关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆