解析模板语言 [英] Parsing a templating language
问题描述
我正在尝试解析模板语言,但无法正确解析可能出现在标签之间的任意 html.到目前为止,我所拥有的是以下内容,有什么建议吗?一个有效输入的例子是
{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}这应该被解析为缓冲区.{/bar2}
语法是:
语法g;选项 {语言=Java;输出=AST;ASTLabelType=普通树;}/* 词法分析器规则 */令牌{}LD : '{';RD : '}';环形 : '#';END_LOOP: '/';部分:'>';片段数字:'0'..'9';片段字母 : ('a'..'z' | 'A'..'Z');IDENT : (字母 | '_') (字母 | '_' | 数字)*;缓冲区选项 {greedy=false;} : ~(LD | RD)+ ;/* 解析器规则 */开始:正文EOF;正文:(标签 | 循环 | 部分 | 缓冲区)*;标签: LD!身份^ RD!;循环:LD!LOOP^ IDENT RD!身体天啊!END_LOOP!身份识别!;部分:LD!部分^ IDENT RD!;缓冲区:缓冲区;
您的词法分析器独立于您的解析器进行标记.如果您的解析器尝试匹配 BUFFER
标记,词法分析器不会考虑此信息.在您输入如下的情况下:"blah blah blah"
,词法分析器创建 3 个 IDENT
标记,而不是单个 BUFFER
标记.
你需要告诉"你的词法分析器的是,当你在一个标签内(即你遇到一个 LD
标签)时,应该创建一个 IDENT
标记,并且当您在标记之外(即遇到 RD
标记)时,应创建 BUFFER
标记而不是 IDENT
标记.
为了实现这一点,您需要:
- 在词法分析器中创建一个
boolean
标志,以跟踪您在标签内或标签外的事实.这可以在语法的@lexer::members { ... }
部分内完成; - 在词法分析器创建
LD
- 或RD
-token 后,翻转 (1) 中的boolean
标志.这可以在词法规则的@after{ ... }
部分完成; - 在词法分析器内创建
BUFFER
标记之前,请检查您目前是否在标记之外.这可以通过使用 创建的图像)I'm trying to parse a templating language and I'm having trouble correctly parsing the arbitrary html that can appear between tags. So far what I have is below, any suggestions? An example of a valid input would be
{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}This Should Be Parsed as a Buffer.{/bar2}
And the grammar is:
grammar g; options { language=Java; output=AST; ASTLabelType=CommonTree; } /* LEXER RULES */ tokens { } LD : '{'; RD : '}'; LOOP : '#'; END_LOOP: '/'; PARTIAL : '>'; fragment DIGIT : '0'..'9'; fragment LETTER : ('a'..'z' | 'A'..'Z'); IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*; BUFFER options {greedy=false;} : ~(LD | RD)+ ; /* PARSER RULES */ start : body EOF ; body : (tag | loop | partial | BUFFER)* ; tag : LD! IDENT^ RD! ; loop : LD! LOOP^ IDENT RD! body LD! END_LOOP! IDENT RD! ; partial : LD! PARTIAL^ IDENT RD! ; buffer : BUFFER ;
解决方案Your lexer tokenizes independently from your parser. If your parser tries to match a
BUFFER
token, the lexer does not take this info into account. In your case with input like:"blah blah blah"
, the lexer creates 3IDENT
tokens, not a singleBUFFER
token.What you need to "tell" your lexer is that when you're inside a tag (i.e. you encountered a
LD
tag), aIDENT
token should be created, and when you're outside a tag (i.e. you encountered aRD
tag), aBUFFER
token should be created instead of anIDENT
token.In order to implement this, you need to:
- create a
boolean
flag inside the lexer that keeps track of the fact that you're in- or outside a tag. This can be done inside the@lexer::members { ... }
section of your grammar; - after the lexer either creates a
LD
- orRD
-token, flip theboolean
flag from (1). This can be done in the@after{ ... }
section of the lexer rules; - before creating a
BUFFER
token inside the lexer, check if you're outside a tag at the moment. This can be done by using a semantic predicate at the start of your lexer rule.
A short demo:
grammar g; options { output=AST; ASTLabelType=CommonTree; } @lexer::members { private boolean insideTag = false; } start : body EOF -> body ; body : (tag | loop | partial | BUFFER)* ; tag : LD IDENT RD -> IDENT ; loop : LD LOOP IDENT RD body LD END_LOOP IDENT RD -> ^(LOOP body IDENT IDENT) ; partial : LD PARTIAL IDENT RD -> ^(PARTIAL IDENT) ; LD @after{insideTag=true;} : '{'; RD @after{insideTag=false;} : '}'; LOOP : '#'; END_LOOP : '/'; PARTIAL : '>'; SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}; IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*; BUFFER : {!insideTag}?=> ~(LD | RD)+; fragment DIGIT : '0'..'9'; fragment LETTER : ('a'..'z' | 'A'..'Z');
(note that you probably want to discard spaces between tag, so I added a
SPACE
rule and discarded these spaces)Test it with the following class:
import org.antlr.runtime.*; import org.antlr.runtime.tree.*; import org.antlr.stringtemplate.*; public class Main { public static void main(String[] args) throws Exception { String src = "{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}" + "This Should Be Parsed as a Buffer.{/bar2}"; gLexer lexer = new gLexer(new ANTLRStringStream(src)); gParser parser = new gParser(new CommonTokenStream(lexer)); CommonTree tree = (CommonTree)parser.start().getTree(); DOTTreeGenerator gen = new DOTTreeGenerator(); StringTemplate st = gen.toDOT(tree); System.out.println(st); } }
and after running the main class:
*nix/MacOS
java -cp antlr-3.3.jar org.antlr.Tool g.g javac -cp antlr-3.3.jar *.java java -cp .:antlr-3.3.jar Main
Windows
java -cp antlr-3.3.jar org.antlr.Tool g.g javac -cp antlr-3.3.jar *.java java -cp .;antlr-3.3.jar Main
You'll see some DOT-source being printed to the console, which corresponds to the following AST:
(image created using graphviz-dev.appspot.com)
这篇关于解析模板语言的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- create a