JavaCC可以通过其上下文区分令牌吗? [英] Can JavaCC distinguish token by its context?

查看:171
本文介绍了JavaCC可以通过其上下文区分令牌吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

基本要求是使用关键字作为标识符,所以我想区分令牌和它的上下文。(例如 class 是一个关键字,但我们允许一个名为<$的变量c $ c> class )。

Basic requirement is use keyword as identifier, so I want to distinguish the token from it's context.(e.g.class is a keyword, but we allowed a variable named class).

在java中,这是可能的,但它很难,这里是我的工作方式

In java, this is possible, but it's so hard, here is how I do it

TOKEN :
{
    <I_CAL:     "CAL">  : DO_CAL
    | <I_CALL:  "CALL">
    | <I_CMP:   "CMP">
    | <I_EXIT:  "EXIT">
    | <I_IN:    "IN">
    | <I_JMP:   "JMP">
    | <I_JPC:   "JPC">  : NEED_CMP_OP
    | <I_LD:    "LD">   : NEED_DATA_TYPE
    | <I_NOP:   "NOP">
    | <I_OUT:   "OUT">
    | <I_POP:   "POP">
    | <I_PUSH:  "PUSH">
    | <I_RET:   "RET">
    | <I_DATA:  "DATA"> : DO_DATA
    | <I_BLOCK:  ".BLOCK">
}

// T prefix for Token
TOKEN :
{
    <T_REGISTER : "R0" | "R1" | "R2" | "R3" | "RP" | "RF" |"RS" | "RB">
// We need below TOKEN in special context, other wise they are just IDENTIFIER
//    | <DATA_TYPE: "DWORD" | "WORD" | "BYTE" | "FLOAT" | "INT">
//    | <PSEUDO_DATA_TYPE: "CHAR" >
//    | <CAL_OP: "ADD" | "SUB" | "MUL" | "DIV" | "MOD">
//    | <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ">
    | <T_LABEL: <IDENTIFIER> ([" "])* <COLON>>
}

// Now we need a CMP OP
<NEED_CMP_OP> TOKEN:
{
    <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ"> : DEFAULT
}
// Now we need a DATA TYPE
<NEED_DATA_TYPE,DO_CAL> TOKEN:
{
    // EXTENSION Add char to data type
    <DATA_TYPE: "DWORD" | "WORD" | "BYTE" | "FLOAT" | "INT" | "CHAR"> {
        if(curLexState == DO_CAL){
            SwitchTo(NEED_CAL_OP);
        }else{
            SwitchTo(DEFAULT);
        }
    }
}
// We need a CAL OP
<NEED_CAL_OP> TOKEN:
{
    <CAL_OP: "ADD" | "SUB" | "MUL" | "DIV" | "MOD"> : DEFAULT
}
// Aslo need to skip the empty
<NEED_DATA_TYPE,NEED_CAL_OP,NEED_CMP_OP,DO_CAL,DO_DATA> SKIP:
{
    " "
|   "\t"
|   "\r"
|   "\f"
}

来源这里,我可以区分来自上下文的令牌由 curLexState

Source is here, I can distinguish the token from context by curLexState.

这是有效的,但是做的很繁琐,需要添加很多额外的状态,并保持很多状态。有没有简单的方法来实现这一点?

It is works, but fussy to do, need to add a lot extra state, and maintain a lot states.Is there any easy way to achieve this ?

推荐答案

有三种方法可以做到这一点 JavaCC常见问题解答

There are three ways to do this outlined in the JavaCC FAQ.


  • 一个是使用词汇状态,就像你所做的那样。这种方法可能很棘手,但它是处理最长匹配的长度取决于上下文或跳过规则取决于上下文的情况的唯一方法。对于你的问题,它可能比你需要的更复杂。

  • 第二种是使用一种令牌类型,并使用基于令牌图像的语义预测来让解析器处理一些令牌特别在某些情况下。有关详细信息,请参阅常见问题解答。

  • 第三种(通常也是最简单的)方法是在词汇层面进行区分,然后忽略句法层面的区别。这通常是处理可以兼作标识符的关键字的最佳方式。

下面我将给出第三种方法的三个例子。

Below I'll give three examples of the third approach.

如果你想做的就是允许关键字来作为变量名使用,有一种非常简单的方法可以做到这一点。按照通常规则放置的词法分析器。

If all you want to do is to allow the keyword class to be used as a variable name, there is a very simple way to do this. In the lexer put in the usual rules.

TOKEN: { <CLASS: "class"> }
TOKEN: { < VARNAME: ["a-"z","A"-Z"](["a-"z","A"-Z"])* > } // Or what you will

在解析器中写一个产品

Token varName() { Token t ; } : {
{
    (t = <CLASS> | t = <VARNAME>)
    {return t ;}
}

然后在解析器的其他位置使用 varName()

Then use varName() elsewhere in the parser.

转到原始问题中的汇编程序示例,让我们看看以JPC指令为例。 JPC(跳转条件)指令后跟一个比较运算符,如Z,B等,然后是一个操作数,可以是许多东西,包括标识符。例如。我们可以

Turning to the assembler example in the original question, let's look at the JPC instruction as an example. The JPC (Jump conditional) instruction is followed by a comparison operator such as Z, B, etc and then an operand that can be a number of things including identifiers. E.g. we could have

JPC Z fred

但我们也可能有一个名为JPC或Z的标识符,所以

But we could also have an identifier named JPC or Z, so

JPC Z JPC

JPC Z Z

也是有效的JPC指令。

are also a valid JPC instructions.

在词汇部分,我们有

TOKEN : // Opcodes
{
    <I_CAL: "CAL"> 
|   <I_JPC: "JPC"> 
|   ... // other op codes
    <CMP_OP: "Z" | "B" | "BE" | "A" | "AE" | "NZ">
|   <T_REGISTER : "R0" | "R1" | "R2" | "R3" | "RP" | "RF" |"RS" | "RB">
}
... // Other lexical rules.

TOKEN : // Be sure this rule comes after all keywords.
{
    < IDENTIFIER: <LETTER> (<LETTER>|<DIGIT>)* >
}

在解析器中我们有

Instruction Instruction():{
    Instruction inst = new Instruction();
    Token o = null,dataType = null,calType = null,cmpType = null;
    Operand a = null,b = null; }
{
    ...
    o = <I_JPC> cmpType = <CMP_OP> a = Operand()
    ...
}

Operand Operand():{
    Token t ; ... }
{
     t = <T_REGISTER> ...
|    t = Identifier()  ...
    ...
}

Token Identifier : {
    Token t ; }
{
    t = <IDENTIFIER> {return t ;}
|   t = <I_CAL>      {return t ;}
|   t = <I_JPC>      {return t ;}
|   t = <CMP_OP>     {return t ;}
| ... // All other keywords
}






我建议从其他可用作标识符的关键字列表中排除注册名称。


I would suggest excluding register names from the list of other keywords that could be used as identifiers.

如果包括< T_REGISTER> ; 在该列表中,操作数会出现歧义,因为操作数看起来像这样

If you do include <T_REGISTER> in that list, then there will be an ambiguity in operand because Operand looks like this

Operand Operand():{
    Token t ; ... }
{
     t = <T_REGISTER> ...
|    t = Identifier()  ...
    ...
}

现在有一种歧义,因为

JPC Z R0

有两个解析。在作为操作数的上下文中,我们希望将诸如R0的令牌解析为寄存器而不是标识符。幸运的是,JavaCC会更喜欢早先的选择,所以这正是将要发生的事情。您将收到来自JavaCC的警告。您可以忽略该警告。 (我在源代码中添加了注释,以便其他程序员不必担心。)或者您可以使用前瞻规范来抑制警告。

has two parses. In the context of being an operand, we want tokens like "R0" to be parsed as registers not identifiers. Luckly JavaCC will prefer earlier choices, so this is exactly what will happen. You will get a warning from JavaCC. You can ignore the warning. (I add a comment to my source code so that other programmers don't worry.) Or you can suppress the warning with a lookahead specification.

Operand Operand():{
    Token t ; ... }
{
     LOOKAHEAD(1) t = <T_REGISTER> ...
|    t = Identifier()  ...
    ...
}






使用正确的上下文



到目前为止,所有示例都使用了左上下文。即我们可以告诉如何仅根据左侧令牌的顺序来处理令牌。让我们看一下关键字的解释是基于右边的标记的情况。


Using right context

So far all the examples have used left context. I.e. we can tell how to treat a token based solely on the sequence of tokens to its left. Let's look at a case where the interpretation of a keyword is based on the tokens to the right.

考虑这种简单的命令式语言,其中所有关键字都可以用作变量名。

Consider this simple imperative language in which all the keywords can be used as variable names.

P -> Block <EOF>
Block -> [S Block]
S -> Assignment | IfElse
Assignment -> LHS ":=" Exp
LHS -> VarName
IfElse -> "if" Exp Block ["else" Block] "end"
Exp -> VarName
VarName -> <ID> | if | else | end

这个语法是明确的。你可以通过添加新的语句,表达式和左手边来使语法更复杂;只要语法保持明确,这些并发症可能不会对我接下来要说的内容产生太大影响。随意试验。

This grammar is unambiguous. You can make the grammar more complicated by adding new kinds of statements, expressions and left-hand sides; as long as the grammar stays unambiguous, such complications probably won't make much difference to what I'm going to say next. Feel free to experiment.

语法不是LL(1)。有两个地方必须根据多个未来令牌进行选择。一种是当下一个标记是if时,在 Assignment IfElse 之间进行选择。考虑块

The grammar is not LL(1). There are two places where a choice must be made based on more than one future token. One is the choice between Assignment and IfElse when the next token is "if". Consider the block

a := b
if := a

vs

a := b
if q
    b := c
end

我们可以展望未来:=喜欢这个

We can look ahead for a ":=" like this

void S() : {} {
    LOOKAHEAD( LHS() ":=" ) Assignment()
|
    IfElse() 
}

我们需要展望的另一个地方是当在块的开头遇到else或end时。考虑

The other place we need to look ahead is when an "else" or an "end" is encountered at the start of a Block. Consider

if x
    end := y
    else := z
end

我们可以用

void Block() : {} {
    LOOKAHEAD( LHS() ":=" | "if" ) S() Block()
|
    {}
}

这篇关于JavaCC可以通过其上下文区分令牌吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆