理解 ANTLR4 代币 [英] Understanding ANTLR4 Tokens

查看:29
本文介绍了理解 ANTLR4 代币的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对 ANTLR 还很陌生,我正在尝试了解 ATNLR4 中的 Token 到底是什么.考虑以下非常荒谬的语法:

I'm pretty new to ANTLR and I'm trying to understand what exactly Token is in ATNLR4. Consider the following pretty nonsensical grammar:

grammar Tst;

init: A token=('+'|'-') B;

A: .+?;
B: .+?;
ADD: '+';
SUB: '-';

ANTLR4 为其生成以下 TstParser.InitContext:

ANTLR4 generates the following TstParser.InitContext for it:

public static class InitContext extends ParserRuleContext {
        public Token token;       //<---------------------------- HERE
        public TerminalNode A() { return getToken(TstParser.A, 0); }
        public TerminalNode B() { return getToken(TstParser.B, 0); }
        public InitContext(ParserRuleContext parent, int invokingState) {
            super(parent, invokingState);
        }
        @Override public int getRuleIndex() { return RULE_init; }
        @Override
        public void enterRule(ParseTreeListener listener) {
            if ( listener instanceof TstListener ) ((TstListener)listener).enterInit(this);
        }
        @Override
        public void exitRule(ParseTreeListener listener) {
            if ( listener instanceof TstListener ) ((TstListener)listener).exitInit(this);
        }
    }

现在,所有词法分析器规则都可用作解析器类中的静态常量:

Now, all lexer rules are available as static constants in the parser class:

public static final int A=1, B=2, ADD=3, SUB=4;

我们如何使用它们来识别词法分析器规则?所有ABADD 规则都可以匹配'+'.那么在测试时我应该使用什么类型.

How can we us them to identify lexer rules? All A, B, and ADD rules may match '+'. So what type should I use when testing it.

我的意思是:

TstParser.InitContext ctx;
//...
ctx.token.getType() == //What type?
                       //TstParse.A
                       //TstParse.B
                       //or
                       //TstParse.ADD?

一般来说,我想了解ANTLR4如何知道Token的类型?

In general, I would like to learn how ANTLR4 knows the type of a Token?

推荐答案

我会尽量向大家介绍解析的过程.该过程有两个阶段.词法分析器部分(创建令牌的地方)和解析器部分.(这就是解析表达式的来源——如果我们谈论的是一般的解析,那就不是很精确了).您在此过程中要做的就是了解输入,同时可能创建输入模型.为了缓解这种情况,工作通常分为更小的步骤.理解主要表示为单词"的标记(输入的元素比字符大一些)要容易得多.(准确地说是关键字、变量、文字).

I will try to introduce you to the process of parsing. There are two stages of the process. Lexer part (where tokens are created) and parser part. (This is where parsing expression comes from - not very precise if we are talking about parsing in general). All you are trying to do in the process is to understand the input and meanwhile maybe create a model of the input. To ease this, job is generally divided into smaller steps. It is much easier to understand tokens (somewhat bigger elements of input than characters) represented mainly as "words". (Keywords, variables, literals to be precise).

因此,您要做的第一步是以字符流的形式将输入预处理为 TOKENS.关于代币,您只能说与它相关的价值是什么以及它是什么类型的代币.例如在非常简单的计算器输入2+3*9" '2' 代表数值 2 的数字标记,'+' 代表数值 '+' 的运算符标记等等......词法分析器部分的结果是标记流.可以想象,词法分析器和解析器规则非常相似.第一步处理字符,第二步处理标记.

Because of this the first step you do is to pre-process the input in the form of character stream into TOKENS. All you can say about the token is what value is connected with it and what kind of token it is. For instance in very simple calculator input "2+3*9" '2' represents number token of value 2, '+' represents operator toke of value '+' and so on... The result of lexer part is stream of tokens. As you can imagine, lexer and parser rules are very similar. First step works with characters, second step works with tokens.

关于 ANTLR(许多其他生成器的工作方式相同),有一个关于词法分析器的重要规则.您不能对不同的令牌使用相同的规则.因此,您插入的语法将不起作用,因为词法分析器部分在 A 和 B 之间不能不同.您可以对双方使用相同的标记名称.稍后你会照顾它.

Regarding ANTLR (many other generators works the same), there is one important rule regarding lexer. You cannot have the same rule for different tokens. So the grammar you have inserted wont work as the lexer part cannot differ between A and B. You can just use the same token name for both sides. You will take care of it later.

为什么词法分析器规则不能相同?当词法分析器处理输入时,它会遍历流.它尝试它找到的第一个词法分析器规则,如果没有问题,它将应用它.因此,如果还有另一条规则也适用,嗯,真是太可惜了.它不会有机会.ANTLR 中的解析器比词法分析器大得多.

Why cannot lexer rules be the same? As the lexer process the input, it walks the stream. It tries the first lexer rule it finds and if it is ok it will apply it. So if there is another rule that would apply as well, hm, what a pitty. It would not get a chance. Parser is much more generous in ANTLR than lexer.

总结一下.令牌是词法分析器的产物,它们是一个或多个字符的组,应该作为一个单元呈现给下一步.我们正在谈论变量名、运算符、函数名等.

To sum it up. Tokens are products of lexer, they are groups of one or more characters that should be presented to next step as a single unit. We are taling about variable names, operators, function names etc.

这篇关于理解 ANTLR4 代币的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆