匹配相似字符串的 Antlr lexer 标记,如果贪婪的词法分析器出错了怎么办? [英] Antlr lexer tokens that match similar strings, what if the greedy lexer makes a mistake?

查看:29
本文介绍了匹配相似字符串的 Antlr lexer 标记,如果贪婪的词法分析器出错了怎么办?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

似乎有时 Antlr 词法分析器在标记字符流时对使用哪个规则做出了错误的选择......我试图弄清楚如何帮助 Antlr 做出显而易见的正确选择.我想像这样解析文本:

It seems that sometimes the Antlr lexer makes a bad choice on which rule to use when tokenizing a stream of characters... I'm trying to figure out how to help Antlr make the obvious-to-a-human right choice. I want to parse text like this:

d/dt(x)=a
a=d/dt
d=3
dt=4

这是一种现有语言使用的不幸语法,我正在尝试为其编写解析器.d/dt(x)"代表微分方程的左侧.如果必须,请忽略术语,只需知道它不是d"除以dt".然而,第二次出现的d/dt"实际上是d"除以dt".

This is an unfortunate syntax that an existing language uses and I'm trying to write a parser for. The "d/dt(x)" is representing the left hand side of a differential equation. Ignore the lingo if you must, just know that it is not "d" divided by "dt". However, the second occurrence of "d/dt" really is "d" divided by "dt".

这是我的语法:

grammar diffeq_grammar;

program :   (statement? NEWLINE)*;

statement
    :   diffeq
    |   assignment;

diffeq  :   DDT ID ')' '=' ID;

assignment
    :   ID '=' NUMBER
    |   ID '=' ID '/' ID
    ;

DDT :   'd/dt(';
ID  :   'a'..'z'+;
NUMBER  :   '0'..'9'+;
NEWLINE :   '\r\n'|'\r'|'\n';

当使用这个语法时,词法分析器获取第一个d/dt("并将其转换为标记 DDT.完美!现在词法分析器看到第二个d"后跟一个/"并说嗯,我可以将其作为 ID 和 '/' 匹配,或者我可以贪婪并匹配 DDT".词法分析器选择贪婪......但它几乎不知道,输入流中后面的几个字符没有(". 当词法分析器查找丢失的("时,它会抛出一个 MismatchedTokenException!

When using this grammar the lexer grabs the first "d/dt(" and turns it to the token DDT. Perfect! Now later the lexer sees the second "d" followed by a "/" and says "hmmm, I can match this as an ID and a '/' or I can be greedy and match DDT". The lexer chooses to be greedy... but little does it know, there is no "(" a few characters later in the input stream. When the lexer looks for the missing "(" it throws a MismatchedTokenException!

到目前为止我找到的唯一解决方案是将所有规则移动到解析器中,语法如下:

The only solution I've found so far, is to move all the rules into the parser with a grammar like:

grammar diffeq_grammar;

program :   (statement? NEWLINE)*;

statement
    :   diffeq
    |   assignment;

diffeq  :   ddt id ')' '=' id;

assignment
    :   id '=' number
    |   id '=' id '/' id
    ;

ddt :   'd' '/' 'd' 't' '(';
id  :   CHAR+;
number  :   DIGIT+;
CHAR    :   'a'..'z';
DIGIT   :   '0'..'9';
NEWLINE :   '\r\n'|'\r'|'\n';

如果我还没有依赖第一个语法工作的数千行工作代码,这是一个很好的解决方案.在花了 2 天研究这个问题后,我得出的结论是词法分析器......真的应该能够区分这两种情况.在某些时候,Antlr 词法分析器会在两个规则之间做出决定:DDT 和 ID.它选择 DDT 是因为词法分析器是贪婪的.但是当匹配 DDT 失败时,我希望词法分析器重新使用 ID.

This is a fine solution if I didn't already have thousands of lines of working code that depend on the first grammar working. After spending 2 days researching this problem I have come to the conclusion that a lexer... really ought to be able to distinguish the two cases. At some point the Antlr lexer is deciding between two rules: DDT and ID. It chooses DDT because the lexer is greedy. But when matching DDT fails, I'd like the lexer to go back to using ID.

我可以使用谓词或其他技巧,只要语法保持基本相同(即词法分析器中的规则,留在词法分析器中.大多数规则保持不变.)

I'm okay with using predicates or other tricks as long as the grammar remains basically the same (i.e., the rules in the lexer, stay in the lexer. And most rules are left untouched.).

理想情况下,我可以使用任何有效的 Antlr 代码修改 DDT 的词法分析器规则......并完成.

Ideally I can modify the lexer rule for DDT with any valid Antlr code... and be done.

我的目标语言是 Java.

My target language is Java.

谢谢!

更新

谢谢你们的一些很棒的答案!!我接受了最适合我的问题的答案.我使用的实际解决方案是在我自己的答案中(不是接受的答案),还有更多可能有效的答案.读者,请查看所有答案;其中一些可能比我的更适合您的情况.

Thank you guys for some great answers!! I accepted the answer that best fit my question. The actual solution I used is in my own answer (not the accepted answer), and there are more answers that could have worked. Readers, check out all the answers; some of them may suit your case better than mine.

推荐答案

我可以使用谓词或其他技巧,只要语法保持基本相同(即词法分析器中的规则,留在词法分析器中.大多数规则保持不变.)

I'm okay with using predicates or other tricks as long as the grammar remains basically the same (i.e., the rules in the lexer, stay in the lexer. And most rules are left untouched.).

在这种情况下,强制词法分析器在字符流中向前看,以确保确实存在 "d/dt(" 使用 门控句法谓词.

In that case, force the lexer to look ahead in the char-stream to make sure there really is "d/dt(" using a gated syntactic predicate.

演示:

grammar diffeq_grammar;

@parser::members {
  public static void main(String[] args) throws Exception {
    String src = 
        "d/dt(x)=a\n" +
        "a=d/dt\n" +
        "d=3\n" +
        "dt=4\n";
    diffeq_grammarLexer lexer = new diffeq_grammarLexer(new ANTLRStringStream(src));
    diffeq_grammarParser parser = new diffeq_grammarParser(new CommonTokenStream(lexer));
    parser.program();
  }
}

@lexer::members {
  private boolean ahead(String text) {
    for(int i = 0; i < text.length(); i++) {
      if(input.LA(i + 1) != text.charAt(i)) {
        return false;
      }
    }
    return true;
  }
}

program
 : (statement? NEWLINE)* EOF
 ;

statement
 : diffeq     {System.out.println("diffeq     : " + $text);}
 | assignment {System.out.println("assignment : " + $text);}
 ;

diffeq
 : DDT ID ')' '=' ID
 ;

assignment
 : ID '=' NUMBER
 | ID '=' ID '/' ID
 ;

DDT     : {ahead("d/dt(")}?=> 'd/dt(';
ID      : 'a'..'z'+;
NUMBER  : '0'..'9'+;
NEWLINE : '\r\n' | '\r' | '\n';

如果您现在运行演示:

java -cp antlr-3.3.jar org.antlr.Tool diffeq_grammar.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar diffeq_grammarParser

(使用 Windows 时,将最后一个命令中的 : 替换为 ;)

(when using Windows, replace the : with ; in the last command)

您将看到以下输出:

diffeq     : d/dt(x)=a
assignment : a=d/dt
assignment : d=3
assignment : dt=4

这篇关于匹配相似字符串的 Antlr lexer 标记,如果贪婪的词法分析器出错了怎么办?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆