词法分析器,重叠规则,但想要更短的匹配 [英] Lexer, overlapping rule, but want the shorter match

查看:25
本文介绍了词法分析器,重叠规则,但想要更短的匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想读取一个输入流并将输入分为两种类型:PATTERN &WORD_WEIGHT,定义如下.

I want to read an input stream and divide the input into 2 types: PATTERN & WORD_WEIGHT, which are defined below.

问题源于为 WORD_WEIGHT 定义的所有字符对于 PATTERN 也有效.当我们有多个 WORD_WEIGHT 之间没有空格时,词法分析器将匹配 PATTERN 而不是传递多个 WORD_WEIGHT.

The problem arises from the fact that all the chars defined for a WORD_WEIGHT are also valid for a PATTERN. When we have multiple WORD_WEIGHTs without spaces between the lexer will match PATTERN rather than deliver multiple WORD_WEIGHTs.

我需要能够处理以下情况并得到指示的结果:

I need to be able to handle the following cases and get the indicated result:

  • [20] => WORD_WEIGHT
  • 猫 => 模式
  • [狗] => 模式

这就是问题所在.它匹配 PATTERN 因为词法分析器将选择两种可能性中较长的一种.笔记:它们之间没有空隙.

And this case, which is the problem. It matches PATTERN because the lexer will select the longer of the 2 possibilities. Note: there's no space between them.

  • [20][30] => WORD_WEIGHT WORD_WEIGHT

还需要处理这种情况(这对可能的解决方案).请注意,括号可能不匹配对于模式...

Also need to handle this case (which imposes some limits on the possible solutions). Note that the brackets may not be matching for a PATTERN...

  • [[[cat] => 模式

语法如下:

grammar Brackets;

fragment
DIGIT
    : ('0'..'9')
    ;

fragment
WORD_WEIGHT_VALUE           
    : ('-' | '+')? DIGIT+ ('.' DIGIT+)? 
    | ('-' | '+')? '.' DIGIT+
    ;

WORD_WEIGHT 
    : '[' WORD_WEIGHT_VALUE ']' 
    ;

PATTERN   
    : ~(' ' | '\t' | '\r' | '\n' )+  
    ;

WS 
    : (' ' | '\t' | '\r' | '\n' )+ -> Skip
    ;


start : (PATTERN | WORD_WEIGHT)* EOF;

问题是,什么样的 Lexer 规则会给出想要的结果?

The question is, what Lexer rules would give the desired result?

我想要一个功能,一个可以为影响匹配过程的词法分析器规则指定的特殊指令.它会指示词法分析器在匹配规则时停止匹配过程并使用这个匹配的标记.

I'm wishing for a feature, a special directive that one can specify for a lexer rule that affects the matching process. It would instruct the lexer, upon a match of the rule, to stop the matching process and use this matched token.

跟进 - 我们选择追求的解决方案:

FOLLOW-UP - THE SOLUTION WE CHOSE TO PURSUE:

将上面的 WORD_WEIGHT 替换为:

Replace WORD_WEIGHT above with:

fragment
WORD_WEIGHT 
    : '[' WORD_WEIGHT_VALUE ']'
    ;

WORD_WEIGHTS
    : WORD_WEIGHT (INNER_WS? WORD_WEIGHT)*
    ;

fragment
INNER_WS
    : (' ' | '\t' )+
    ;

此外,语法规则变为:

start : (PATTERN | WORD_WEIGHTS)* EOF;

现在,任何单词权重序列(无论是否有空格分隔)都将是 WORD_WEIGHTS 标记的值.这恰好也符合我们的用法——我们的语法(不在上面的片段中)总是将词权重定义为一个或多个".现在,多重性被词法分析器而不是解析器捕获".如果/当我们需要单独处理每个词的权重时,我们可以在应用程序(解析树侦听器)中拆分值.

Now, any sequence of word weights (either space separated or not), will be the value of WORD_WEIGHTS token. This happens to match our usage too - our grammar (not in the snippet above) always defines word weights as "one or more". Now, the multiplicity is "captured" by the lexer instead of the parser. If/when we need to process each word weight separately we can split the value in the application (parse tree listener).

推荐答案

你可以实现WORD_WEIGHT如下:

WORD_WEIGHT
  : '[' WORD_WEIGHT_VALUE ']'
    PATTERN?
  ;

然后,在您的词法分析器中,您可以覆盖 emit 方法以更正词法分析器的位置以删除添加到末尾的 PATTERN(如果有)WORD_WEIGHT 令牌.你可以在 ANTLRWorks 2 中看到这样的例子:

Then, in your lexer, you can override the emit method to correct the position of the lexer to remove the PATTERN (if any) which was added to the end of the WORD_WEIGHT token. You can see examples of this in ANTLRWorks 2:

修改需要以下步骤.

  1. 覆盖 LexerATNSimulator 以添加 resetAcceptPosition 方法.
  2. 在词法分析器类的构造函数中将 _interp 字段设置为自定义 LexerATNSimulator 的实例.
  3. 计算令牌所需的结束位置,然后调用resetAcceptPosition.对于您在 ST4 示例中看到的固定宽度标记,计算只是出现在标记开头的固定运算符或关键字的长度.对于您的情况,您需要调用 getText() 并检查结果以确定 WORD_WEIGHT 令牌的正确长度.由于WORD_WEIGHT_VALUE规则不能匹配],最简单的分析可能是在]字符的索引>getText()(WORD_WEIGHT 的语法确保该字符始终存在).
  1. Override LexerATNSimulator to add the resetAcceptPosition method.
  2. Set the _interp field to an instance of your custom LexerATNSimulator in the constructor for your lexer class.
  3. Calculate the desired end position for your token, and call resetAcceptPosition. For fixed-width tokens like you see in the ST4 examples, the calculation was simply the length of the fixed operator or keyword which appeared at the beginning of the token. For your case, you will need to call getText() and examine the result to determing the correct length of your WORD_WEIGHT token. Since the WORD_WEIGHT_VALUE rule cannot match ], the easiest analysis would probably be to find the index of the first ] character in the result of getText() (the syntax of WORD_WEIGHT ensures the character will always exist).

这篇关于词法分析器,重叠规则,但想要更短的匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆