在多字符令牌上切换到孤岛模式 [英] Switching to island mode on multi-character token

查看:24
本文介绍了在多字符令牌上切换到孤岛模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究一种基本上是孤岛语法的语法.

I am working on a grammar that is basically an island grammar.

让我们说岛"是大括号之间的一切,海"是一切不是.像这样:

Let's say the "island" is everything between braces, the "sea" is everything that is not. Like this:

{(岛屿内容)}

那么这个简单的语法就起作用了:

Then this simple grammar works:

IslandStart
:
    '{' -> pushMode(Island)
;

Fluff
:
    ~[\{\}]+
;

....

但是对于我想要为我的岛"块打开复杂(多字符)的情况,我很难想出类似的解决方案,如下所示:

But I'm having trouble to come up with a similar solution to a case where I want the complex (multi-character) opening for my "island" block, like this:

{#(岛屿内容)}

在这种情况下,我不知道如何为绒毛"制定规则(除了我的开场顺序).

In this case I don't know how to make a rule for "Fluff" (everything but my opening sequence).

IslandStart
    :
        '{#' -> pushMode(Island)
    ;

Fluff
    :
        ~[\{\}]+ /* Should now include opening braces as well 
                    if they are not immaediately followed by # sign */
    ;

我如何让它发挥作用?

GRosenberg 提出了一个解决方案,但我得到了很多令牌(每个字符一个).这是演示此行为的示例:

GRosenberg came up with a solution but I get a lot of tokens (one per character) with it. This is an example to demonstrate this behaviour:

我的词法分析器语法:

lexer grammar Demolex;

IslandStart
    :
        '{$' -> pushMode(Island)
    ;


Fluff
    : 
          '{' ~'$' .* // any 2+ char seq that starts with '{', but not '{#'
        | '{' '$$' .* // starts with hypothetical not IslandStart marker
        | '{'         // just the 1 char 
        | .*? ~'{'    // minimum sequence that ends before an '{'
    ;

mode Island;

IslandEnd
    :
        '}' -> popMode
    ;

最简单的解析器语法:

grammar Demo;
options { tokenVocab = Demolex; }

template
    :
        Fluff+
    ;

当我在 Eclipse 的 antlr4 插件中调试它时,这会从输入somanytokens"中生成一个包含大量令牌的树:

This generates a tree with a lot of tokens from the input "somanytokens" when I debug it in antlr4 plugin for Eclipse:

这不太可能是插件问题.我可以很容易地想出一个令牌定义,它会在树中产生一个大的胖令牌.

It's not likely that it's a plugin problem. I can easily come up with a token definition that will a result in a big fat token in the tree.

实际上,即使是最简单的语法形式也会给出这样的结果:

Actually, even the simplest form of grammar gives this result:

grammar Demo2;

template4
    :
        Fluff+
    ;

Fluff
    : 
         .*? ~'{'    // minimum sequence that ends before an '{'
    ;

推荐答案

只需要指定序列差异的补:

Just need to specify the complement of the sequence difference:

IslandStart : '{#' -> pushMode(Island) ;

Fluff       : '{' ~'#' .* // any 2+ char seq that starts with '{', but not '{#'
            | '{' '##' .* // starts with hypothetical not IslandStart marker
            | '{'         // just the 1 char 
            | .*? ~'{'    // minimum sequence that ends before an '{'
            ;

Fluff alt2 在它是相对于 IslandStart 的更长匹配时起作用.Fluff alt3 仅在 IslandStart 和 Fluff alt1 与以{"开头的字符序列不匹配时才起作用.Fluff alt4 是包含但不包括 '{' 的内容的集合,允许词法分析器考虑在 '{' 上对齐的序列.

Fluff alt2 works when it is the longer match relative to IslandStart. Fluff alt3 works only when IslandStart and Fluff alt1 do not match a character sequence starting with '{'. Fluff alt4 is the catchall for content up to but not including a '{', allowing the lexer to consider sequences aligned on an '{'.

更新

让它成为一个更合理完整的示例语法

Lets make it a more reasonably complete example grammar

parser grammar TestParser;

options{
    tokenVocab=TestLexer;
}

template : ( Fluff | Stuff )+ EOF ;

lexer grammar TestLexer;

IslandStart : '{' '$' -> pushMode(Island),more ;

Fluff : '{' ~'$' ~'{'*? '}'     // any 2+ char seq that starts with '{', but not '{$'
      | '{' '$' '$' ~'{'*? '}'  // or starts with hypothetical not IslandStart marker
      | '{' '}'                 // just the empty pair
      | ~'{'+                   // minimum sequence that ends before an '{'
      ;

mode Island;

Stuff : '}' -> popMode ;
Char  : .   -> more    ;

输入so{$Island}许多{}令牌{$$notIsland}和{inner}end

令牌转储:

Fluff: [@0,0:1='so',<1>,1:0]
Stuff: [@1,2:10='{$Island}',<2>,1:2]
Fluff: [@2,11:14='many',<1>,1:11]
Fluff: [@3,15:16='{}',<1>,1:15]
Fluff: [@4,17:22='tokens',<1>,1:17]
Fluff: [@5,23:35='{$$notIsland}',<1>,1:23]
Fluff: [@6,36:38='and',<1>,1:36]
Fluff: [@7,39:45='{inner}',<1>,1:39]
Fluff: [@8,46:48='end',<1>,1:46]

解析树:

(template so {$Island} many {} tokens {$$notIsland} and {inner} end <EOF>)

词法分析器规则的操作保持不变.进行了更改以适应正确的父母匹配终端.简化后的 Alt4 可以按原先的预期工作.不完全确定为什么 Antlr 一开始是一个问题,但无论如何都更简单.

Operation of the lexer rules remains the same. Changes were made to accommodate the right paren match terminals. Alt4, as simplified, works as originally intended. Not entirely sure why it was a problem for Antlr to begin with, but simpler is better in any case.

这篇关于在多字符令牌上切换到孤岛模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆