令牌类型取决于以下令牌 [英] Token type depends on following token

查看:156
本文介绍了令牌类型取决于以下令牌的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很难用一个非常简单的语法。谷歌搜索和书籍阅读没有帮助。我最近开始使用ANTLR,所以这可能是一个非常简单的问题。

I am stuck with a pretty simple grammar. Googling and books reading did not help. I started to use ANTLR quite recently, so probably this is a very simple question.

我正在尝试使用ANTLR v3编写一个非常简单的Lexer。

I am trying to write a very simple Lexer using ANTLR v3.

grammar TestLexer;

options {
  language = Java;
}

TEST_COMMENT
    :   '/*' WS? TEST WS? '*/'
    ;

ML_COMMENT
    :   '/*' ( options {greedy=false;} : .)* '*/' {$channel=HIDDEN;}
    ;

TEST    :   'TEST'
    ;

WS  :   (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel=HIDDEN;}
    ;

测试类:

public class TestParserInvoker {
    private static void extractCommandsTokens(final String script) throws RecognitionException {

        final ANTLRStringStream input = new ANTLRStringStream(script);
        final Lexer lexer = new TestLexer(input);

        final TokenStream tokenStream = new CommonTokenStream(lexer);
        Token t;
        do {
            t = lexer.nextToken();
            if (t != null) {
                System.out.println(t);
            }
        } while (t == null || t.getType() != Token.EOF);
    }


    public static void main(final String[] args) throws RecognitionException {
        final String script = "/* TEST */";
        extractCommandsTokens(script);
    }
}

所以当测试字符串是/ * TEST * / 词法分子按预期产生两个代币。一个类型为 TEST_COMMENT ,另一个类型为EOF。一切都还可以。

So when test string is "/* TEST */" the lexer produces as expected two tokens. One with type TEST_COMMENT and one with EOF. Everything is OK.

但是如果测试字符串最后包含一个额外的空格:/ * TEST * /lexer产生三个标记: ML_COMMENT ,WS和EOF。

But if test string contains one extra space in the end: "/* TEST */ " lexer produces three tokens: ML_COMMENT, WS and EOF.

为什么第一个令牌获得ML_COMMENT类型?我认为检测到令牌的方式仅取决于语法中词法分析器规则的优先级。当然,它不应该依赖于后续代币。

Why does first token get ML_COMMENT type? I thought the way how token detected depends only on precedence of lexer rules in grammar. And of course it should not depend on following tokens.

感谢您的帮助!

P.S。我可以使用词法分析器选项filter = true - 令牌将获得正确的类型,但这种方法需要在令牌定义中进行额外的工作。说实话,我不想使用这种类型的词法分析器。

P.S. I can use lexer option filter=true - token will get the correct type, but this approach requires extra work in tokens definitions. To be honest, I do not want to use this type of lexer.

推荐答案

ANTLR从顶部规则开始向下标记字符流尝试匹配多达可能。所以,是的,我也希望为/ * TEST * /创建 TEST_COMMENT / * TEST * /。您可以随时查看生成的词法分析器的源代码,了解为何选择为第二个输入创建 ML_COMMENT

ANTLR tokenizes the character stream starting from the top rule downwards and tries to match as much as possible. So, yes, I would also have expected a TEST_COMMENT to be created for both "/* TEST */" and "/* TEST */ ". You can always have a look at the generated source code of the lexer to see why it chooses to create a ML_COMMENT for the second input.

无论这是一个错误还是预期的行为,我都不会使用看起来像这样的单独的词法分析器规则。你能解释一下你在这里要解决的问题吗?

Whether this is a bug, or expected behavior, I would not use separate lexer rules that look so much a-like. Could you explain what you're really trying to solve here?


user776872写道:

user776872 wrote:

我可以使用词法分析器选项filter = true - 令牌将获得正确的类型,但这种方法需要在令牌定义中进行额外的工作。说实话,我不想使用这种类型的词法分析器。

I can use lexer option filter=true - token will get the correct type, but this approach requires extra work in tokens definitions. To be honest, I do not want to use this type of lexer.

我不太明白这句话。您是否只对输入源的一部分感兴趣?在这种情况下, filter = true 肯定是一个不错的选择。如果你想要标记所有输入源,那么你不应该使用 filter = true

I don't quite understand this remark. Are you only interested in a part of the input source? In that case, filter=true is surely a good option. If you want to tokenize all input source, then you shouldn't use filter=true.

如果要区分多行注释和Javadoc注释,最好将它们保存在同一规则中,如果以<$ c开头,则更改令牌的类型$ c> / ** 喜欢这样:

In case of making a distinction between multi line comments and Javadoc comments, it's best to keep these in the same rule and change the type of the token if it starts with /** like this:

grammar T;

// options

tokens {
  DOC_COMMENT;
}

// rules

COMMENT
  :  '/*' (~'*' .*)? '*/'
  |  '/**' ~'/' .* '*/' {$type=DOC_COMMENT;}
  ;

请注意。* 。+ 默认情况下在ANTLR中非贪婪(与普遍看法相反)。

Note that both .* and .+ are by default non-greedy in ANTLR (contrary to popular belief).

grammar T;

tokens {
  DOC_COMMENT;
}

@parser::members {
  public static void main(String[] args) throws Exception {
    TLexer lexer = new TLexer(new ANTLRStringStream("/**/ /*foo*/ /**bar*/"));
    TParser parser = new TParser(new CommonTokenStream(lexer));
    parser.parse();
  }
}

parse
  :  (t=. {System.out.println(tokenNames[$t.type] + " :: " + $t.text);})* EOF
  ;

COMMENT
  :  '/*' (~'*' .*)? '*/'
  |  '/**' ~'/' .* '*/' {$type=DOC_COMMENT;}
  ;

SPACE
  :  ' ' {$channel=HIDDEN;}
  ;

产生:

bart@hades:~/Programming/ANTLR/Demos/T$ java -cp antlr-3.3.jar org.antlr.Tool T.g
bart@hades:~/Programming/ANTLR/Demos/T$ javac -cp antlr-3.3.jar *.java
bart@hades:~/Programming/ANTLR/Demos/T$ java -cp .:antlr-3.3.jar TParser 
COMMENT :: /**/
COMMENT :: /*foo*/
DOC_COMMENT :: /**bar*/

这篇关于令牌类型取决于以下令牌的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆