解析模板语言 [英] Parsing a templating language

查看:22
本文介绍了解析模板语言的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析模板语言,但无法正确解析可能出现在标签之间的任意 html.到目前为止,我所拥有的是以下内容,有什么建议吗?一个有效输入的例子是

{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}这应该被解析为缓冲区.{/bar2}

语法是:

语法g;选项 {语言=Java;输出=AST;ASTLabelType=普通树;}/* 词法分析器规则 */令牌{}LD : '{';RD : '}';环形    :    '#';END_LOOP: '/';部分:'>';片段数字:'0'..'9';片段字母 : ('a'..'z' | 'A'..'Z');IDENT : (字母 | '_') (字母 | '_' | 数字)*;缓冲区选项 {greedy=false;} : ~(LD | RD)+ ;/* 解析器规则 */开始:正文EOF;正文:(标签 | 循环 | 部分 | 缓冲区)*;标签: LD!身份^ RD!;循环:LD!LOOP^ IDENT RD!身体天啊!END_LOOP!身份识别!;部分:LD!部分^ IDENT RD!;缓冲区:缓冲区;

解决方案

您的词法分析器独立于您的解析器进行标记.如果您的解析器尝试匹配 BUFFER 标记,词法分析器不会考虑此信息.在您输入如下的情况下:"blah blah blah",词法分析器创建 3 个 IDENT 标记,而不是单个 BUFFER 标记.

你需要告诉"你的词法分析器的是,当你在一个标签内(即你遇到一个 LD 标签)时,应该创建一个 IDENT 标记,并且当您在标记之外(即遇到 RD 标记)时,应创建 BUFFER 标记而不是 IDENT 标记.

为了实现这一点,您需要:

  1. 在词法分析器中创建一个 boolean 标志,以跟踪您在标签内或标签外的事实.这可以在语法的 @lexer::members { ... } 部分内完成;
  2. 在词法分析器创建 LD- 或 RD-token 后,翻转 (1) 中的 boolean 标志.这可以在词法规则的 @after{ ... } 部分完成;
  3. 在词法分析器内创建 BUFFER 标记之前,请检查您目前是否在标记之外.这可以通过使用 创建的图像)

    I'm trying to parse a templating language and I'm having trouble correctly parsing the arbitrary html that can appear between tags. So far what I have is below, any suggestions? An example of a valid input would be

    {foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}This Should Be Parsed as a Buffer.{/bar2}
    

    And the grammar is:

    grammar g;
    
    options {
      language=Java;
      output=AST;
      ASTLabelType=CommonTree;
    }
    
    /* LEXER RULES */
    tokens {
    
    }
    
    LD  :    '{';
    RD  :    '}';
    LOOP    :    '#';  
    END_LOOP:   '/';
    PARTIAL :   '>';
    fragment DIGIT  : '0'..'9';
    fragment LETTER : ('a'..'z' | 'A'..'Z');
    IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
    BUFFER options {greedy=false;} : ~(LD | RD)+ ;
    
    /* PARSER RULES */
    start   : body EOF
    ;
    
    body    : (tag | loop | partial | BUFFER)*
    ;
    
    tag     : LD! IDENT^ RD!
    ;
    
    loop    : LD! LOOP^ IDENT RD!
      body
      LD! END_LOOP! IDENT RD!
    ;
    
     partial : LD! PARTIAL^ IDENT RD!
    ;
    
    buffer  : BUFFER 
    ;
    

    解决方案

    Your lexer tokenizes independently from your parser. If your parser tries to match a BUFFER token, the lexer does not take this info into account. In your case with input like: "blah blah blah", the lexer creates 3 IDENT tokens, not a single BUFFER token.

    What you need to "tell" your lexer is that when you're inside a tag (i.e. you encountered a LD tag), a IDENT token should be created, and when you're outside a tag (i.e. you encountered a RD tag), a BUFFER token should be created instead of an IDENT token.

    In order to implement this, you need to:

    1. create a boolean flag inside the lexer that keeps track of the fact that you're in- or outside a tag. This can be done inside the @lexer::members { ... } section of your grammar;
    2. after the lexer either creates a LD- or RD-token, flip the boolean flag from (1). This can be done in the @after{ ... } section of the lexer rules;
    3. before creating a BUFFER token inside the lexer, check if you're outside a tag at the moment. This can be done by using a semantic predicate at the start of your lexer rule.

    A short demo:

    grammar g;
    
    options { 
      output=AST;
      ASTLabelType=CommonTree; 
    }
    
    @lexer::members {
      private boolean insideTag = false;
    }
    
    start   
      :  body EOF -> body
      ;
    
    body
      :  (tag | loop | partial | BUFFER)*
      ;
    
    tag
      :  LD IDENT RD -> IDENT
      ;
    
    loop    
      :  LD LOOP IDENT RD body LD END_LOOP IDENT RD -> ^(LOOP body IDENT IDENT)
      ;
    
    partial 
      :  LD PARTIAL IDENT RD -> ^(PARTIAL IDENT)
      ;
    
    LD @after{insideTag=true;}  : '{';
    RD @after{insideTag=false;} : '}';
    
    LOOP     : '#';  
    END_LOOP : '/';
    PARTIAL  : '>';
    SPACE    : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
    IDENT    :  (LETTER | '_') (LETTER | '_' | DIGIT)*;
    BUFFER   : {!insideTag}?=> ~(LD | RD)+;
    
    fragment DIGIT  : '0'..'9';
    fragment LETTER : ('a'..'z' | 'A'..'Z');
    

    (note that you probably want to discard spaces between tag, so I added a SPACE rule and discarded these spaces)

    Test it with the following class:

    import org.antlr.runtime.*;
    import org.antlr.runtime.tree.*;
    import org.antlr.stringtemplate.*;
    
    public class Main {
      public static void main(String[] args) throws Exception {
        String src = "{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}" + 
                     "This Should Be Parsed as a Buffer.{/bar2}";
        gLexer lexer = new gLexer(new ANTLRStringStream(src));
        gParser parser = new gParser(new CommonTokenStream(lexer));
        CommonTree tree = (CommonTree)parser.start().getTree();
        DOTTreeGenerator gen = new DOTTreeGenerator();
        StringTemplate st = gen.toDOT(tree);
        System.out.println(st);
      }
    }
    

    and after running the main class:

    *nix/MacOS

    java -cp antlr-3.3.jar org.antlr.Tool g.g 
    javac -cp antlr-3.3.jar *.java
    java -cp .:antlr-3.3.jar Main
    

    Windows

    java -cp antlr-3.3.jar org.antlr.Tool g.g 
    javac -cp antlr-3.3.jar *.java
    java -cp .;antlr-3.3.jar Main
    

    You'll see some DOT-source being printed to the console, which corresponds to the following AST:

    (image created using graphviz-dev.appspot.com)

    这篇关于解析模板语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆