'IDENTIFIER' 规则也使用 ANTLR Lexer 语法中的关键字 [英] 'IDENTIFIER' rule also consumes keyword in ANTLR Lexer grammar

查看:26
本文介绍了'IDENTIFIER' 规则也使用 ANTLR Lexer 语法中的关键字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在处理用于 Java 解析的 Antlr 3.5 语法时,注意到IDENTIFIER"规则在 ANTLR 词法分析器语法中消耗很少的关键字.词法分析器语法是

While working on Antlr 3.5 grammar for Java parsing noticed that 'IDENTIFIER' rule consumes few Keywords in ANTLR Lexer grammar. The Lexer grammar is

lexer grammar JavaLexer;

options {
   //k=8;
   language=Java;
   filter=true;
   //backtrack=true;
}

@lexer::header {
package java;
}

@lexer::members {
public ArrayList<String> keywordsList = new ArrayList<String>();
}

V_DECLARATION
:
( ((MODIFIERS)=>tok1=MODIFIERS WS+)? tok2=TYPE WS+ var=V_DECLARATOR WS* )
{...};

fragment
V_DECLARATOR
  :
  (
    tok=IDENTIFIER WS* ( ',' | ';' | ASSIGN WS* V_VALUE )
  )
  {...};

fragment
V_VALUE
: (IDENTIFIER (DOT WS* IDENTIFIER WS* '(' | ',' | ';'))
;

MODIFIERS
  :
  (PUBLIC | PRIVATE | FINAL)+
;

PRIVATE
    :    tok = 'private'
    { keywordsList.add($tok.getText());  }
    ;

PUBLIC
    :    tok = 'public'
    { keywordsList.add($tok.getText()); }
    ;

DOT
    :    '.'
    { keywordsList.add("."); }
    ;

THIS
    :    tok = 'this'
    { keywordsList.add($tok.getText()); }
    ;

ASSIGN
    :    '='
        { keywordsList.add("="); }
    ;    

IDENTIFIER:
  tok =Identifier
  {  
   //System.out.println("Identifier: " + $tok.text);
  }
  ;  

fragment
Identifier 
    :   (Letter (Letter|JavaIDDigit)*);

fragment
Letter
    :  '\u0024' |
       '\u0041'..'\u005a' |
       '\u005f' |
       '\u0061'..'\u007a' |
       '\u00c0'..'\u00d6' |
       '\u00d8'..'\u00f6' |
       '\u00f8'..'\u00ff' |
       '\u0100'..'\u1fff' |
       '\u3040'..'\u318f' |
       '\u3300'..'\u337f' |
       '\u3400'..'\u3d2d' |
       '\u4e00'..'\u9fff' |
       '\uf900'..'\ufaff'
    ;

fragment
JavaIDDigit
    :  '\u0030'..'\u0039' |
       '\u0660'..'\u0669' |
       '\u06f0'..'\u06f9' |
       '\u0966'..'\u096f' |
       '\u09e6'..'\u09ef' |
       '\u0a66'..'\u0a6f' |
       '\u0ae6'..'\u0aef' |
       '\u0b66'..'\u0b6f' |
       '\u0be7'..'\u0bef' |
       '\u0c66'..'\u0c6f' |
       '\u0ce6'..'\u0cef' |
       '\u0d66'..'\u0d6f' |
       '\u0e50'..'\u0e59' |
       '\u0ed0'..'\u0ed9' |
       '\u1040'..'\u1049'
   ;

WS  :  (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN; skip();}
    ;

当我尝试解析该行时:

public final int inch = this.getValue();

那么规则VAR_VALUE -> IDENTIFIER"也会消耗this"关键字,这是不可取的,因为关键字也被收集到一个单独的列表中.

Then the rule 'VAR_VALUE -> IDENTIFIER', also consumes the "this" keyword, which is undesirable, since keywords also be collected into a separate list.

Antlr 语法中是否有任何技巧/规定可以通过自身规则匹配关键字而不影响IDENTIFIER"等其他功能?

Is there any trick/provision in Antlr grammar to match the keywords by itself rule without effecting the other functionality like "IDENTIFIER"?

推荐答案

你的问题确实是对词法分析器什么属于解析器的误解造成的:

Your problem is indeed caused by the misunderstanding of what belongs in lexer and what belongs in parser:

  • 词法分析器的工作是确定字符流代表哪些词
    • 例如this 是一个 THIS0 是一个 NUMBERthat 是一个 <代码>标识符
    • Lexer's job is to determine which words the stream of characters represent
      • e.g. that this is a THIS, 0 is a NUMBER and that is an IDENTIFIER
      • 例如该声明由可能的修饰符、类型和标识符列表组成

      由于词法分析器的工作是确定输入中有哪些词,它会处理输入并查找最长有效匹配(在 ANTLR 中,如果两个或多个规则接受相同的输入,则在源语法获胜).不是针对任何最具体的",而是针对最长的.

      Since lexer's job is to determine which words are on the input, it processes the input and looks for longest valid match (in ANTLR, if two or more rules accept same input, the topmost one in source grammar wins). Not for any "most specific", but simply the longest one.

      示例:

      • 输入t
        • 可以是 THISIDENTIFIER
        • 仍然可以是 THISIDENTIFIER
        • 不能再是THIS,只有IDENTIFIER是可能的
        • Can no longer be THIS, only IDENTIFIER is possible
        • IDENTIFIER 确定
        • 不再匹配 IDENTIFIER,所以 that 将匹配为 IDENTIFIER 并且最后输入的 . 将是匹配作为下一个令牌的新开始
        • No longer matches IDENTIFIER, so that will be matched as IDENTIFIER and the last input . will be matched as a new start of next token

        另一个例子:

        • 输入this
          • 可以始终匹配为 THISIDENTIFIER
          • Input t, h, i, s
            • Can be matched as either THIS or IDENTIFIER whole time
            • 不能再被任何东西匹配,所以 this 将被匹配为 THIS(最上面的匹配规则)而不是 IDENTIFIER. 将开始一个新的令牌
            • Can no longer be matched by anything, so this will be matched as THIS (topmost matching rule) rather than IDENTIFIER and . will start a new token

            现在是重要的部分 - 只要一个词法分析器规则被另一个词法分析器规则引用,它就被认为只是引用词法分析器规则的一个片段.这意味着匹配它不会发出新的标记,并且它也不会在片段匹配结束时触发多个匹配标记之间的任何决定.由于 this 确实可以通过 IDENTIFIER 规则匹配,整个声明符合 V_DECLARATION 词法分析器规则 - 所以除非有另一个 词法分析器 规则可以匹配至少相同长度的输入并且在语法中早于此规则,则此规则将适用.

            And now to the important part - as long as a lexer rule is referenced from another lexer rule, it's considered to be merely a fragment of the referencing lexer rule. This means that matching it won't emit a new token, and also that it won't trigger any decisions between multiple matching tokens at the end of the fragment's match. Since this can indeed be matched by IDENTIFIER rule, the whole declaration conforms to the V_DECLARATION lexer rule - so unless there's another lexer rule that can match at least the same length of input and is earlier in the grammar than this rule, this rule will apply.

            您没有提供任何引用 THIS 的规则,所以我们不知道这在您的语法中究竟是如何发挥作用的,但显而易见的原因是词法分析器可以匹配更长的输入或比更早的规则任何使用 THIS 规则的东西.

            You didn't provide any rule referencing THIS so we don't know how exactly this plays out in your grammar, but the obvious cause is that lexer can match longer input or with earlier rule than anything that uses THIS rule.

            这篇关于'IDENTIFIER' 规则也使用 ANTLR Lexer 语法中的关键字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆