在多字符令牌上切换到孤岛模式 [英] Switching to island mode on multi-character token

查看:89
本文介绍了在多字符令牌上切换到孤岛模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一种基本上是岛式语法的语法.

I am working on a grammar that is basically an island grammar.

比方说,岛"是括号之间的所有内容,海"是不是括号中的所有内容.像这样:

Let's say the "island" is everything between braces, the "sea" is everything that is not. Like this:

{(岛屿内容)}

然后这个简单的语法起作用:

Then this simple grammar works:

IslandStart
:
    '{' -> pushMode(Island)
;

Fluff
:
    ~[\{\}]+
;

....

但是,对于我想要为"island"块添加复杂(多字符)的情况的情况,我很难提出类似的解决方案,例如:

But I'm having trouble to come up with a similar solution to a case where I want the complex (multi-character) opening for my "island" block, like this:

{#(岛屿内容)}

在这种情况下,我不知道如何为绒毛"(除了我的打开顺序)做任何规则.

In this case I don't know how to make a rule for "Fluff" (everything but my opening sequence).

IslandStart
    :
        '{#' -> pushMode(Island)
    ;

Fluff
    :
        ~[\{\}]+ /* Should now include opening braces as well 
                    if they are not immaediately followed by # sign */
    ;

我如何使其工作?

GRosenberg提出了一个解决方案,但是我得到了很多令牌(每个字符一个).这是演示此行为的示例:

GRosenberg came up with a solution but I get a lot of tokens (one per character) with it. This is an example to demonstrate this behaviour:

我的词法分析器语法:

lexer grammar Demolex;

IslandStart
    :
        '{$' -> pushMode(Island)
    ;


Fluff
    : 
          '{' ~'$' .* // any 2+ char seq that starts with '{', but not '{#'
        | '{' '$$' .* // starts with hypothetical not IslandStart marker
        | '{'         // just the 1 char 
        | .*? ~'{'    // minimum sequence that ends before an '{'
    ;

mode Island;

IslandEnd
    :
        '}' -> popMode
    ;

最简单的解析器语法:

grammar Demo;
options { tokenVocab = Demolex; }

template
    :
        Fluff+
    ;

当我在Eclipse的antlr4插件中调试它时,这会从输入"somanytokens"生成带有很多标记的树:

This generates a tree with a lot of tokens from the input "somanytokens" when I debug it in antlr4 plugin for Eclipse:

这不太可能是插件问题.我可以轻松地提出一个令牌定义,这将导致树中有一个大的胖令牌.

It's not likely that it's a plugin problem. I can easily come up with a token definition that will a result in a big fat token in the tree.

实际上,即使是最简单的语法形式也可以得到以下结果:

Actually, even the simplest form of grammar gives this result:

grammar Demo2;

template4
    :
        Fluff+
    ;

Fluff
    : 
         .*? ~'{'    // minimum sequence that ends before an '{'
    ;

推荐答案

只需指定序列差的补码即可.

Just need to specify the complement of the sequence difference:

IslandStart : '{#' -> pushMode(Island) ;

Fluff       : '{' ~'#' .* // any 2+ char seq that starts with '{', but not '{#'
            | '{' '##' .* // starts with hypothetical not IslandStart marker
            | '{'         // just the 1 char 
            | .*? ~'{'    // minimum sequence that ends before an '{'
            ;

绒毛alt2相对于IslandStart较长时,才起作用.仅当IslandStart和Fluff alt1与以'{'开头的字符序列不匹配时,Fluff alt3才起作用. Fluff alt4是内容最多的内容,但不包括'{',但允许词法分析器考虑在'{'上对齐的序列.

Fluff alt2 works when it is the longer match relative to IslandStart. Fluff alt3 works only when IslandStart and Fluff alt1 do not match a character sequence starting with '{'. Fluff alt4 is the catchall for content up to but not including a '{', allowing the lexer to consider sequences aligned on an '{'.

更新

让它成为一个更合理的完整示例语法

Lets make it a more reasonably complete example grammar

parser grammar TestParser;

options{
    tokenVocab=TestLexer;
}

template : ( Fluff | Stuff )+ EOF ;

lexer grammar TestLexer;

IslandStart : '{' '$' -> pushMode(Island),more ;

Fluff : '{' ~'$' ~'{'*? '}'     // any 2+ char seq that starts with '{', but not '{$'
      | '{' '$' '$' ~'{'*? '}'  // or starts with hypothetical not IslandStart marker
      | '{' '}'                 // just the empty pair
      | ~'{'+                   // minimum sequence that ends before an '{'
      ;

mode Island;

Stuff : '}' -> popMode ;
Char  : .   -> more    ;

输入so{$Island}many{}tokens{$$notIsland}and{inner}end

令牌转储:

Fluff: [@0,0:1='so',<1>,1:0]
Stuff: [@1,2:10='{$Island}',<2>,1:2]
Fluff: [@2,11:14='many',<1>,1:11]
Fluff: [@3,15:16='{}',<1>,1:15]
Fluff: [@4,17:22='tokens',<1>,1:17]
Fluff: [@5,23:35='{$$notIsland}',<1>,1:23]
Fluff: [@6,36:38='and',<1>,1:36]
Fluff: [@7,39:45='{inner}',<1>,1:39]
Fluff: [@8,46:48='end',<1>,1:46]

解析树:

(template so {$Island} many {} tokens {$$notIsland} and {inner} end <EOF>)

词法分析器规则的操作保持不变.进行了更改以适应右侧paren匹配终端.简化后的Alt4可以按原计划工作.不能完全确定为什么一开始对Antlr来说是个问题,但在任何情况下,越简单越好.

Operation of the lexer rules remains the same. Changes were made to accommodate the right paren match terminals. Alt4, as simplified, works as originally intended. Not entirely sure why it was a problem for Antlr to begin with, but simpler is better in any case.

这篇关于在多字符令牌上切换到孤岛模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆