不同状态下的不同词法分析器规则 [英] Different lexer rules in different state

查看:25
本文介绍了不同状态下的不同词法分析器规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在为一些嵌入在 HTML (FreeMarker) 中的模板语言开发解析器,这里有一个例子:

${abc}<头><title>欢迎您!</title><身体><h1>欢迎 ${user}<#if user == "Big Joe">,我们亲爱的领导</#if>!<p>我们的最新产品:<a href="${latestProduct}">${latestProduct}</a>!</html>

模板语言位于某些特定标签之间,例如'${' '}', '<#' '>'.中间的其他原始文本可以视为相同的标记 (RAW).

这里的关键是相同的文本,例如一个整数,对于解析器来说意味着不同的东西取决于它是否在这些标签之间,因此需要被视为不同的标记.

我尝试了以下丑陋的实现,使用自定义状态来指示它是否在这些标签中.如您所见,我几乎在每个规则中都必须检查状态,这让我发疯...

我还考虑了以下两种解决方案:

  1. 使用多个词法分析器.在这些标签内部/外部时,我可以在两个词法分析器之间切换.但是,这方面的文档对于 ANTLR3 来说很差.我不知道如何让一个解析器共享两个不同的词法分析器并在它们之间切换.

  2. 在 NUMERICAL_ESCAPE 规则之后向上移动 RAW 规则.检查那里的状态,如果它在标签中,放回令牌并继续尝试左边的规则.这将节省大量的状态检查.但是,我没有找到任何放回"功能,并且 ANTLR 抱怨某些规则永远无法匹配...

有没有优雅的解决方案?

语法 freemarker_simple;@lexer::members {int freemarker_type = 0;}表达:primary_expression ;主表达式: number_literal |标识符 |括号 |内置变量;插入语: OPEN_PAREN 表达式 CLOSE_PAREN ;数字文字: 整数 |十进制;标识符:   ID;内置变量: 点号;字符串输出: OUTPUT_ESCAPE 表达式 CLOSE_BRACE;数字输出: NUMERICAL_ESCAPE 表达式 CLOSE_BRACE;if_表达式:START_TAG IF 表达式 DIRECTIVE_END optional_block( START_TAG ELSE_IF 表达式loose_directive_end optional_block )*( END_TAG ELSE optional_block )?END_TAG END_IF;list : START_TAG LIST 表达式 AS ID DIRECTIVE_END optional_block END_TAG END_LIST ;for_each: START_TAG FOREACH ID IN expression DIRECTIVE_END optional_block END_TAG END_FOREACH ;松散指示结束:( DIRECTIVE_END | EMPTY_DIRECTIVE_END ) ;freemarker_directive:(if_expression | list | for_each ) ;内容 : ( RAW | string_output | numeric_output | freemarker_directive ) + ;可选块:   ( 内容 )?;根 : optional_block EOF ;START_TAG: '<#'{ freemarker_type = 1;};END_TAG : '</#'{ freemarker_type = 1;};DIRECIVE_END: '>'{如果(freemarker_type == 0)$type=RAW;freemarker_type = 0;};EMPTY_DIRECTIVE_END: '/>'{如果(freemarker_type == 0)$type=RAW;freemarker_type = 0;};输出转义: '${'{ if(freemarker_type == 0) freemarker_type = 2;};NUMERICAL_ESCAPE: '#{'{ if(freemarker_type == 0) freemarker_type = 2;};如果如果'{ if(freemarker_type == 0) $type=RAW;};ELSE : '其他' DIRECTIVE_END{ if(freemarker_type == 0) $type=RAW;};ELSE_IF : 'elseif'{ if(freemarker_type == 0) $type=RAW;};列表:'列表'{ if(freemarker_type == 0) $type=RAW;};FOREACH : 'foreach'{ if(freemarker_type == 0) $type=RAW;};END_IF : '如果' DIRECTIVE_END{ if(freemarker_type == 0) $type=RAW;};END_LIST: '列表' DIRECTIVE_END{ if(freemarker_type == 0) $type=RAW;};END_FOREACH:'foreach' DIRECTIVE_END{ if(freemarker_type == 0) $type=RAW;};假:'假' { if(freemarker_type == 0) $type=RAW;};真: '真' { if(freemarker_type == 0) $type=RAW;};整数:('0'..'9')+ { if(freemarker_type == 0) $type=RAW;};十进制:整数 '.'整数 { if(freemarker_type == 0) $type=RAW;};点:'.'{ if(freemarker_type == 0) $type=RAW;};DOT_DOT: '..' { if(freemarker_type == 0) $type=RAW;};加号:'+' { if(freemarker_type == 0) $type=RAW;};减号:'-' { if(freemarker_type == 0) $type=RAW;};时间:'*' { if(freemarker_type == 0) $type=RAW;};DIVIDE: '/' { if(freemarker_type == 0) $type=RAW;};百分比:'%' { if(freemarker_type == 0) $type=RAW;};AND: '&'|'&&'{ if(freemarker_type == 0) $type=RAW;};或:'|'|'||'{ if(freemarker_type == 0) $type=RAW;};感叹:!"{ if(freemarker_type == 0) $type=RAW;};OPEN_PAREN: '(' { if(freemarker_type == 0) $type=RAW; };CLOSE_PAREN: ')' { if(freemarker_type == 0) $type=RAW;};OPEN_BRACE:'{'{ if(freemarker_type == 0) $type=RAW;};CLOSE_BRACE:'}'{如果(freemarker_type == 0)$type=RAW;如果(freemarker_type == 2)freemarker_type = 0;};IN: 'in' { if(freemarker_type == 0) $type=RAW;};AS: 'as' { if(freemarker_type == 0) $type=RAW;};ID : ('A'..'Z'|'a'..'z')+//{ if(freemarker_type == 0) $type=RAW;};空白 : ( '\r' | ' ' | '\n' | '\t' )+{如果(freemarker_type == 0)$type=RAW;否则 $channel = 隐藏;};生的:.;

编辑

我发现问题类似于我如何处理这个输入?,其中需要开始条件".但不幸的是,答案也使用了很多谓词,就像我的状态一样.

现在,我尝试使用谓词将 RAW 移得更高.希望在 RAW 规则后消除所有状态检查.但是,我的示例输入失败,第一行结尾被识别为 BLANK 而不是 RAW 应该是.

我想规则优先级有问题:CLOSE_BRACE匹配后,下一个token从CLOSE_BRACE规则之后的规则中匹配,而不是从头开始.

有什么办法可以解决这个问题吗?

下面带有一些调试输出的新语法:

语法 freemarker_simple;@lexer::members {int freemarker_type = 0;}表达:primary_expression ;主表达式: number_literal |标识符 |括号 |内置变量;插入语: OPEN_PAREN 表达式 CLOSE_PAREN ;数字文字: 整数 |十进制;标识符:   ID;内置变量: 点号;字符串输出: OUTPUT_ESCAPE 表达式 CLOSE_BRACE;数字输出: NUMERICAL_ESCAPE 表达式 CLOSE_BRACE;if_表达式:START_TAG IF 表达式 DIRECTIVE_END optional_block( START_TAG ELSE_IF 表达式loose_directive_end optional_block )*( END_TAG ELSE optional_block )?END_TAG END_IF;list : START_TAG LIST 表达式 AS ID DIRECTIVE_END optional_block END_TAG END_LIST ;for_each: START_TAG FOREACH ID IN expression DIRECTIVE_END optional_block END_TAG END_FOREACH ;松散指示结束:( DIRECTIVE_END | EMPTY_DIRECTIVE_END ) ;freemarker_directive:(if_expression | list | for_each ) ;内容 : ( RAW | string_output | numeric_output | freemarker_directive ) + ;可选块:   ( 内容 )?;根 : optional_block EOF ;START_TAG: '<#'{ freemarker_type = 1;};END_TAG : '</#'{ freemarker_type = 1;};输出转义: '${'{ if(freemarker_type == 0) freemarker_type = 2;};NUMERICAL_ESCAPE: '#{'{ if(freemarker_type == 0) freemarker_type = 2;};生的:{ freemarker_type == 0 }?=>.{System.out.printf("RAW \%s \%d\n",getText(),freemarker_type);};DIRECIVE_END: '>'{ if(freemarker_type == 1) freemarker_type = 0;};EMPTY_DIRECTIVE_END: '/>'{ if(freemarker_type == 1) freemarker_type = 0;};如果如果';ELSE : '其他' DIRECTIVE_END;ELSE_IF : 'elseif';列表:'列表';FOREACH : 'foreach';END_IF : '如果' DIRECTIVE_END;END_LIST: '列表' DIRECTIVE_END;END_FOREACH:'foreach' DIRECTIVE_END;假:'假';真: '真' ;整数:('0'..'9')+;十进制:整数 '.'整数 ;点:'.';DOT_DOT: '..' ;加号:'+';减: '-' ;时间:'*';除法:'/';百分: '%' ;AND: '&'|'&&';或:'|'|'||';感叹:!";OPEN_PAREN: '(' ;CLOSE_PAREN: ')' ;OPEN_BRACE:'{';CLOSE_BRACE:'}'{ if(freemarker_type == 2) {freemarker_type = 0;} };在:'在';AS: '作为' ;ID : ('A'..'Z'|'a'..'z')+{ System.out.printf("ID \%s \%d\n",getText(),freemarker_type);};空白 : ( '\r' | ' ' | '\n' | '\t' )+{System.out.printf("BLANK \%d\n",freemarker_type);$频道=隐藏;};

我的输入结果与输出:

ID abc 2空白 0 <<<不正确,当 state==0 时应该是 RAW原始数据0<<<正确的ID html 0 <<<不正确,应该是RAW RAW RAW RAW原始 >0

编辑 2

还用 Bart 的语法尝试了第二种方法,仍然没有工作,'html' 被识别为 ID,应该是 4 个 RAW.当 mmode=false 时,不应该先匹配 RAW 吗?或者词法分析器仍然在这里选择最长的匹配?

语法 freemarker_bart;选项 {输出=AST;ASTLabelType=普通树;}令牌{文件;输出;原始块;}@parser::members {//将给定的令牌列表合并为单个 AST私有 CommonTree 合并(列出 tokenList){StringBuilder b = new StringBuilder();for(int i = 0; i < tokenList.size(); i++) {Token token = (Token)tokenList.get(i);b.append(token.getText());}return new CommonTree(new CommonToken(RAW, b.toString()));}}@lexer::members {私人布尔 mmode = false;}解析:内容* EOF ->^(文件内容*);内容:(选项{greedy=true;}:t+=RAW)+ ->^(RAW_BLOCK {合并($t)})|if_stat|输出;if_stat:TAG_START IF 表达式 TAG_END raw_block TAG_END_START IF TAG_END ->^(IF 表达式 raw_block);输出: OUTPUT_START 表达式 OUTPUT_END ->^(输出表达式);原始块: (t+=RAW)* ->^(RAW_BLOCK {合并($t)});表达: eq_expression;eq_表达式: 原子 (EQUALS^ 原子)*;原子:  细绳|ID;//这些标记表示标记代码的开始(将 mmode 设置为 true)OUTPUT_START : '${' {mmode=true;};TAG_START : '<#' {mmode=true;};TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});原始:{!mmode}?=>.;//这些标记表示标记代码的结束(将 mmode 设置为 false)OUTPUT_END : '}' {mmode=false;};TAG_END : '>'{mmode=false;};//仅在标记模式"下有效标记等于:'==';如果如果';字符串:'"' ~'"'* '"';ID : ('a'..'z' | 'A'..'Z')+;空格 : (' ' | '\t' | '\r' | '\n')+ {skip();};

解决方案

您可以使用 门控语义谓词,您可以在其中测试某个布尔表达式.

一个小演示:

freemarker_simple.g

语法 freemarker_simple;选项 {输出=AST;ASTLabelType=普通树;}令牌{文件;输出;原始块;}@parser::members {//将给定的令牌列表合并为单个 AST私有 CommonTree 合并(列出 tokenList){StringBuilder b = new StringBuilder();for(int i = 0; i < tokenList.size(); i++) {Token token = (Token)tokenList.get(i);b.append(token.getText());}return new CommonTree(new CommonToken(RAW, b.toString()));}}@lexer::members {私人布尔 mmode = false;}解析:内容* EOF ->^(文件内容*);内容:(选项{greedy=true;}:t+=RAW)+ ->^(RAW_BLOCK {合并($t)})|if_stat|输出;if_stat:TAG_START IF 表达式 TAG_END raw_block TAG_END_START IF TAG_END ->^(IF 表达式 raw_block);输出: OUTPUT_START 表达式 OUTPUT_END ->^(输出表达式);原始块: (t+=RAW)* ->^(RAW_BLOCK {合并($t)});表达: eq_expression;eq_expression: 原子 (EQUALS^ 原子)*;原子:  细绳|ID;//这些标记表示标记代码的开始(将 mmode 设置为 true)OUTPUT_START : '${' {mmode=true;};TAG_START : '<#' {mmode=true;};TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});//这些标记表示标记代码的结束(将 mmode 设置为 false)OUTPUT_END : {mmode}?=>'}' {mmode=false;};TAG_END : {mmode}?=>'>'{mmode=false;};//仅在标记模式"下有效标记;等于:{mmode}?=>'==';如果:{mmode}?=>'如果';字符串:{mmode}?=>'''~'"'* '"';ID : {mmode}?=>('a'..'z' | 'A'..'Z')+;空格:{mmode}?=>(' ' | '\t' | '\r' | '\n')+ {skip();};生的           : .;

解析您的输入:

test.html

${abc}<头><title>欢迎您!</title><身体><h1>欢迎${user}<#if user == Big Joe">,我们敬爱的领袖</#if>!<p>我们的最新产品:<a href=${latestProduct}">${latestProduct}</a>!</p></html>

进入以下AST:

因为你可以在课堂上测试自己:

Main.java

import org.antlr.runtime.*;导入 org.antlr.runtime.tree.*;导入 org.antlr.stringtemplate.*;公共课主要{public static void main(String[] args) 抛出异常 {freemarker_simpleLexer 词法分析器 = new freemarker_simpleLexer(new ANTLRFileStream("test.html"));freemarker_simpleParser parser = new freemarker_simpleParser(new CommonTokenStream(lexer));CommonTree 树 = (CommonTree)parser.parse().getTree();DOTTreeGenerator gen = new DOTTreeGenerator();StringTemplate st = gen.toDOT(tree);System.out.println(st);}}


编辑 1

当我使用从您发布的第二个语法生成的解析器运行您的示例输入时,以下是打印到控制台的前 5 行(不包括生成的许多警告):

ID abc 2生的0原始数据0标识 html 0...


编辑 2

<块引用>

Bood 写道:

还用 Bart 的语法尝试了第二种方法,仍然没有工作,'html' 被识别为 ID,应该是 4 个 RAW.当 mmode=false 时,不应该先匹配 RAW 吗?或者词法分析器仍然在这里选择最长的匹配?

是的,这是正确的:在这种情况下,ANTLR 会选择更长的匹配项.

但是现在我(终于 :))看到你想要做什么,这是最后一个建议:你可以让 RAW 规则匹配字符,只要规则看不到前面的以下字符序列之一:"<#""</#""${"代码>.请注意,该规则仍必须保留在语法的末尾.此检查在词法分析器内部执行.此外,在这种情况下,您不需要解析器中的 merge(...) 方法:

grammar freemarker_simple;选项 {输出=AST;ASTLabelType=普通树;}令牌{文件;输出;原始块;}@lexer::members {私人布尔 mmode = false;私有布尔 rawAhead() {if(mmode) 返回假;int ch1 = input.LA(1), ch2 = input.LA(2), ch3 = input.LA(3);返回 !((ch1 == '<' && ch2 == '#') ||(ch1 == '<' && ch2 == '/' && ch3 == '#') ||(ch1 == '$' && ch2 == '{'));}}解析:内容* EOF ->^(文件内容*);内容:  生的|if_stat|输出;if_stat:TAG_START IF 表达式 TAG_END RAW TAG_END_START IF TAG_END ->^(IF 表达式 RAW);输出: OUTPUT_START 表达式 OUTPUT_END ->^(输出表达式);表达: eq_expression;eq_表达式: 原子 (EQUALS^ 原子)*;原子:  细绳|ID;OUTPUT_START : '${' {mmode=true;};TAG_START : '<#' {mmode=true;};TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});OUTPUT_END : '}' {mmode=false;};TAG_END : '>'{mmode=false;};等于:'==';如果如果';字符串:'''~'"'* '"';ID : ('a'..'z' | 'A'..'Z')+;空格 : (' ' | '\t' | '\r' | '\n')+ {skip();};RAW : ({rawAhead()}?=> . )+;

上面的语法将根据此答案开头发布的输入生成以下 AST:

I've been working on a parser for some template language embeded in HTML (FreeMarker), piece of example here:

${abc}
<html> 
<head> 
  <title>Welcome!</title> 
</head> 
<body> 
  <h1> 
    Welcome ${user}<#if user == "Big Joe">, our beloved 
leader</#if>! 
  </h1> 
  <p>Our latest product: 
  <a href="${latestProduct}">${latestProduct}</a>! 
</body> 
</html>

The template language is between some specific tags, e.g. '${' '}', '<#' '>'. Other raw texts in between can be treated like as the same tokens (RAW).

The key point here is that the same text, e.g. an integer, will mean differently thing for the parser depends on whether it's between those tags or not, and thus needs to be treated as different tokens.

I've tried with the following ugly implementation, with a self-defined state to indicate whether it's in those tags. As you see, I have to check the state almost in every rule, which drives me crazy...

I also thinked about the following two solutions:

  1. Use multiple lexers. I can switch between two lexers when inside/outside those tags. However, the document for this is poor for ANTLR3. I don't know how to let one parser share two different lexers and switch between them.

  2. Move the RAW rule up, after the NUMERICAL_ESCAPE rule. Check the state there, if it's in the tag, put back the token and continue trying the left rules. This would save lots of state checking. However, I don't find any 'put back' function and ANTLR complains about some rules can never be matched...

Is there an elegant solution for this?

grammar freemarker_simple;

@lexer::members {
int freemarker_type = 0;
}

expression
    :   primary_expression ;

primary_expression
    :   number_literal | identifier | parenthesis | builtin_variable
    ;

parenthesis
    :   OPEN_PAREN expression CLOSE_PAREN ;

number_literal
    :   INTEGER | DECIMAL
    ;

identifier
    :   ID
    ;

builtin_variable
    :   DOT ID
    ;

string_output
    :   OUTPUT_ESCAPE expression CLOSE_BRACE
    ;

numerical_output
    :   NUMERICAL_ESCAPE expression  CLOSE_BRACE
    ;

if_expression
    :   START_TAG IF expression DIRECTIVE_END optional_block
        ( START_TAG ELSE_IF expression loose_directive_end optional_block )*
        ( END_TAG ELSE optional_block )?
        END_TAG END_IF
    ;

list    :   START_TAG LIST expression AS ID DIRECTIVE_END optional_block END_TAG END_LIST ;

for_each
    :   START_TAG FOREACH ID IN expression DIRECTIVE_END optional_block END_TAG END_FOREACH ;

loose_directive_end
    :   ( DIRECTIVE_END | EMPTY_DIRECTIVE_END ) ;

freemarker_directive
    :   ( if_expression | list | for_each  ) ;
content :   ( RAW |  string_output | numerical_output | freemarker_directive ) + ;
optional_block
    :   ( content )? ;

root    :   optional_block EOF  ;

START_TAG
    :   '<#'
        { freemarker_type = 1; }
    ;

END_TAG :   '</#'
        { freemarker_type = 1; }
    ;

DIRECTIVE_END
    :   '>'
        {
        if(freemarker_type == 0) $type=RAW;
        freemarker_type = 0;
        }
    ;
EMPTY_DIRECTIVE_END
    :   '/>'
        {
        if(freemarker_type == 0) $type=RAW;
        freemarker_type = 0;
        }
    ;

OUTPUT_ESCAPE
    :   '${'
        { if(freemarker_type == 0) freemarker_type = 2; }
    ;
NUMERICAL_ESCAPE
    :   '#{'
        { if(freemarker_type == 0) freemarker_type = 2; }
    ;

IF  :   'if'
        { if(freemarker_type == 0) $type=RAW; }
    ;
ELSE    :   'else' DIRECTIVE_END
        { if(freemarker_type == 0) $type=RAW; }
    ; 
ELSE_IF :   'elseif'
        { if(freemarker_type == 0) $type=RAW; }
    ; 
LIST    :   'list'
        { if(freemarker_type == 0) $type=RAW; }
    ; 
FOREACH :   'foreach'
        { if(freemarker_type == 0) $type=RAW; }
    ; 
END_IF  :   'if' DIRECTIVE_END
        { if(freemarker_type == 0) $type=RAW; }
    ; 
END_LIST
    :   'list' DIRECTIVE_END
        { if(freemarker_type == 0) $type=RAW; }
    ; 
END_FOREACH
    :   'foreach' DIRECTIVE_END
        { if(freemarker_type == 0) $type=RAW; }
    ;


FALSE: 'false' { if(freemarker_type == 0) $type=RAW; };
TRUE: 'true' { if(freemarker_type == 0) $type=RAW; };
INTEGER: ('0'..'9')+ { if(freemarker_type == 0) $type=RAW; };
DECIMAL: INTEGER '.' INTEGER { if(freemarker_type == 0) $type=RAW; };
DOT: '.' { if(freemarker_type == 0) $type=RAW; };
DOT_DOT: '..' { if(freemarker_type == 0) $type=RAW; };
PLUS: '+' { if(freemarker_type == 0) $type=RAW; };
MINUS: '-' { if(freemarker_type == 0) $type=RAW; };
TIMES: '*' { if(freemarker_type == 0) $type=RAW; };
DIVIDE: '/' { if(freemarker_type == 0) $type=RAW; };
PERCENT: '%' { if(freemarker_type == 0) $type=RAW; };
AND: '&' | '&&' { if(freemarker_type == 0) $type=RAW; };
OR: '|' | '||' { if(freemarker_type == 0) $type=RAW; };
EXCLAM: '!' { if(freemarker_type == 0) $type=RAW; };
OPEN_PAREN: '(' { if(freemarker_type == 0) $type=RAW; };
CLOSE_PAREN: ')' { if(freemarker_type == 0) $type=RAW; };
OPEN_BRACE
    :   '{'
    { if(freemarker_type == 0) $type=RAW; }
    ;
CLOSE_BRACE
    :   '}'
    {
        if(freemarker_type == 0) $type=RAW;
        if(freemarker_type == 2) freemarker_type = 0;
    }
    ;
IN: 'in' { if(freemarker_type == 0) $type=RAW; };
AS: 'as' { if(freemarker_type == 0) $type=RAW; };
ID  :   ('A'..'Z'|'a'..'z')+
    //{ if(freemarker_type == 0) $type=RAW; }
    ;

BLANK   :   ( '\r' | ' ' | '\n' | '\t' )+
    {
        if(freemarker_type == 0) $type=RAW;
        else $channel = HIDDEN;
    }
    ;

RAW
    :   .
    ;

EDIT

I found the problem similar to How do I lex this input? , where a "start condition" is needed. But unfortunately, the answer uses a lot of predicates as well, just like my states.

Now, I tried to move the RAW higher with a predicate. Hoping to eliminate all the state checks after RAW rule. However, my example input failed, the first line end is recogonized as BLANK instead of RAW it should be.

I guess something wrong is about the rule priority: After CLOSE_BRACE is matched, the next token is matched from rules after the CLOSE_BRACE rule, rather than start from the begenning again.

Any way to resolve this?

New grammar below with some debug outputs:

grammar freemarker_simple;

@lexer::members {
int freemarker_type = 0;
}

expression
    :   primary_expression ;

primary_expression
    :   number_literal | identifier | parenthesis | builtin_variable
    ;

parenthesis
    :   OPEN_PAREN expression CLOSE_PAREN ;

number_literal
    :   INTEGER | DECIMAL
    ;

identifier
    :   ID
    ;

builtin_variable
    :   DOT ID
    ;

string_output
    :   OUTPUT_ESCAPE expression CLOSE_BRACE
    ;

numerical_output
    :   NUMERICAL_ESCAPE expression  CLOSE_BRACE
    ;

if_expression
    :   START_TAG IF expression DIRECTIVE_END optional_block
        ( START_TAG ELSE_IF expression loose_directive_end optional_block )*
        ( END_TAG ELSE optional_block )?
        END_TAG END_IF
    ;

list    :   START_TAG LIST expression AS ID DIRECTIVE_END optional_block END_TAG END_LIST ;

for_each
    :   START_TAG FOREACH ID IN expression DIRECTIVE_END optional_block END_TAG END_FOREACH ;

loose_directive_end
    :   ( DIRECTIVE_END | EMPTY_DIRECTIVE_END ) ;

freemarker_directive
    :   ( if_expression | list | for_each  ) ;
content :   ( RAW |  string_output | numerical_output | freemarker_directive ) + ;
optional_block
    :   ( content )? ;

root    :   optional_block EOF  ;

START_TAG
    :   '<#'
        { freemarker_type = 1; }
    ;

END_TAG :   '</#'
        { freemarker_type = 1; }
    ;

OUTPUT_ESCAPE
    :   '${'
        { if(freemarker_type == 0) freemarker_type = 2; }
    ;
NUMERICAL_ESCAPE
    :   '#{'
        { if(freemarker_type == 0) freemarker_type = 2; }
    ;
RAW
    :
        { freemarker_type == 0 }?=> .
        {System.out.printf("RAW \%s \%d\n",getText(),freemarker_type);}
    ;

DIRECTIVE_END
    :   '>'
        { if(freemarker_type == 1) freemarker_type = 0; }
    ;
EMPTY_DIRECTIVE_END
    :   '/>'
        { if(freemarker_type == 1) freemarker_type = 0; }
    ;

IF  :   'if'

    ;
ELSE    :   'else' DIRECTIVE_END

    ; 
ELSE_IF :   'elseif'

    ; 
LIST    :   'list'

    ; 
FOREACH :   'foreach'

    ; 
END_IF  :   'if' DIRECTIVE_END
    ; 
END_LIST
    :   'list' DIRECTIVE_END
    ; 
END_FOREACH
    :   'foreach' DIRECTIVE_END
    ;


FALSE: 'false' ;
TRUE: 'true' ;
INTEGER: ('0'..'9')+ ;
DECIMAL: INTEGER '.' INTEGER ;
DOT: '.' ;
DOT_DOT: '..' ;
PLUS: '+' ;
MINUS: '-' ;
TIMES: '*' ;
DIVIDE: '/' ;
PERCENT: '%' ;
AND: '&' | '&&' ;
OR: '|' | '||' ;
EXCLAM: '!' ;
OPEN_PAREN: '(' ;
CLOSE_PAREN: ')' ;
OPEN_BRACE
    :   '{'
    ;
CLOSE_BRACE
    :   '}'
    { if(freemarker_type == 2) {freemarker_type = 0;} }
    ;
IN: 'in' ;
AS: 'as' ;
ID  :   ('A'..'Z'|'a'..'z')+
    { System.out.printf("ID \%s \%d\n",getText(),freemarker_type);}
    ;

BLANK   :   ( '\r' | ' ' | '\n' | '\t' )+
    {
        System.out.printf("BLANK \%d\n",freemarker_type);
        $channel = HIDDEN;
    }
    ;

My input results with the output:

ID abc 2
BLANK 0  <<< incorrect, should be RAW when state==0
RAW < 0  <<< correct
ID html 0 <<< incorrect, should be RAW RAW RAW RAW
RAW > 0

EDIT2

Also tried the 2nd approach with Bart's grammar, still didn't work the 'html' is recognized as an ID, which should be 4 RAWs. When mmode=false, shouldn't RAW get matched first? Or the lexer still chooses the longest match here?

grammar freemarker_bart;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  FILE;
  OUTPUT;
  RAW_BLOCK;
}

@parser::members {

  // merge a given list of tokens into a single AST
  private CommonTree merge(List tokenList) {
    StringBuilder b = new StringBuilder();
    for(int i = 0; i < tokenList.size(); i++) {
      Token token = (Token)tokenList.get(i);
      b.append(token.getText());
    }
    return new CommonTree(new CommonToken(RAW, b.toString()));
  }
}

@lexer::members {
  private boolean mmode = false;
}

parse
  :  content* EOF -> ^(FILE content*)
  ;

content
  :  (options {greedy=true;}: t+=RAW)+ -> ^(RAW_BLOCK {merge($t)})
  |  if_stat
  |  output
  ;

if_stat
  :  TAG_START IF expression TAG_END raw_block TAG_END_START IF TAG_END -> ^(IF expression raw_block)
  ;

output
  :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
  ;

raw_block
  :  (t+=RAW)* -> ^(RAW_BLOCK {merge($t)})
  ;

expression
  :  eq_expression
  ;

eq_expression
  :  atom (EQUALS^ atom)* 
  ;

atom
  :  STRING
  |  ID
  ;

// these tokens denote the start of markup code (sets mmode to true)
OUTPUT_START  : '${'  {mmode=true;};
TAG_START     : '<#'  {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});

RAW           : {!mmode}?=> . ;

// these tokens denote the end of markup code (sets mmode to false)
OUTPUT_END    : '}' {mmode=false;};
TAG_END       : '>' {mmode=false;};

// valid tokens only when in "markup mode"
EQUALS        : '==';
IF            : 'if';
STRING        : '"' ~'"'* '"';
ID            : ('a'..'z' | 'A'..'Z')+;
SPACE         : (' ' | '\t' | '\r' | '\n')+ {skip();};

解决方案

You could let lexer rules match using gated semantic predicates where you test for a certain boolean expression.

A little demo:

freemarker_simple.g

grammar freemarker_simple;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  FILE;
  OUTPUT;
  RAW_BLOCK;
}

@parser::members {

  // merge a given list of tokens into a single AST
  private CommonTree merge(List tokenList) {
    StringBuilder b = new StringBuilder();
    for(int i = 0; i < tokenList.size(); i++) {
      Token token = (Token)tokenList.get(i);
      b.append(token.getText());
    }
    return new CommonTree(new CommonToken(RAW, b.toString()));
  }
}

@lexer::members {
  private boolean mmode = false;
}

parse
  :  content* EOF -> ^(FILE content*)
  ;

content
  :  (options {greedy=true;}: t+=RAW)+ -> ^(RAW_BLOCK {merge($t)})
  |  if_stat
  |  output
  ;

if_stat
  :  TAG_START IF expression TAG_END raw_block TAG_END_START IF TAG_END -> ^(IF expression raw_block)
  ;

output
  :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
  ;

raw_block
  :  (t+=RAW)* -> ^(RAW_BLOCK {merge($t)})
  ;

expression
  :  eq_expression
  ;

eq_expression
  :  atom (EQUALS^ atom)* 
  ;

atom
  :  STRING
  |  ID
  ;

// these tokens denote the start of markup code (sets mmode to true)
OUTPUT_START  : '${'  {mmode=true;};
TAG_START     : '<#'  {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});

// these tokens denote the end of markup code (sets mmode to false)
OUTPUT_END    : {mmode}?=> '}' {mmode=false;};
TAG_END       : {mmode}?=> '>' {mmode=false;};

// valid tokens only when in "markup mode"
EQUALS        : {mmode}?=> '==';
IF            : {mmode}?=> 'if';
STRING        : {mmode}?=> '"' ~'"'* '"';
ID            : {mmode}?=> ('a'..'z' | 'A'..'Z')+;
SPACE         : {mmode}?=> (' ' | '\t' | '\r' | '\n')+ {skip();};

RAW           : . ;

which parses your input:

test.html

${abc}
<html> 
<head> 
  <title>Welcome!</title> 
</head> 
<body> 
  <h1> 
    Welcome ${user}<#if user == "Big Joe">, our beloved leader</#if>! 
  </h1> 
  <p>Our latest product: <a href="${latestProduct}">${latestProduct}</a>!</p>
</body> 
</html>

into the following AST:

as you can test yourself with the class:

Main.java

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
  public static void main(String[] args) throws Exception {
    freemarker_simpleLexer lexer = new freemarker_simpleLexer(new ANTLRFileStream("test.html"));
    freemarker_simpleParser parser = new freemarker_simpleParser(new CommonTokenStream(lexer));
    CommonTree tree = (CommonTree)parser.parse().getTree();
    DOTTreeGenerator gen = new DOTTreeGenerator();
    StringTemplate st = gen.toDOT(tree);
    System.out.println(st);
  }
}


EDIT 1

When I run your example input with a parser generated from the second grammar you posted, the following are wthe first 5 lines being printed to the console (not counting the many warnings that are generated):

ID abc 2
RAW 
 0
RAW < 0
ID html 0
...


EDIT 2

Bood wrote:

Also tried the 2nd approach with Bart's grammar, still didn't work the 'html' is recognized as an ID, which should be 4 RAWs. When mmode=false, shouldn't RAW get matched first? Or the lexer still chooses the longest match here?

Yes, that is correct: ANTLR chooses the longer match in that case.

But now that I (finally :)) see what you're trying to do, here's a last proposal: you could let the RAW rule match characters as long as the rule can't see one of the following character sequences ahead: "<#", "</#" or "${". Note that the rule must still stay at the end in the grammar. This check is performed inside the lexer. Also, in that case you don't need the merge(...) method in the parser:

grammar freemarker_simple;

options {
  output=AST;
  ASTLabelType=CommonTree;
}

tokens {
  FILE;
  OUTPUT;
  RAW_BLOCK;
}

@lexer::members {
  
  private boolean mmode = false;
  
  private boolean rawAhead() {
    if(mmode) return false;
    int ch1 = input.LA(1), ch2 = input.LA(2), ch3 = input.LA(3);
    return !(
        (ch1 == '<' && ch2 == '#') ||
        (ch1 == '<' && ch2 == '/' && ch3 == '#') ||
        (ch1 == '$' && ch2 == '{')
    );
  }
}

parse
  :  content* EOF -> ^(FILE content*)
  ;

content
  :  RAW
  |  if_stat
  |  output
  ;

if_stat
  :  TAG_START IF expression TAG_END RAW TAG_END_START IF TAG_END -> ^(IF expression RAW)
  ;

output
  :  OUTPUT_START expression OUTPUT_END -> ^(OUTPUT expression)
  ;

expression
  :  eq_expression
  ;

eq_expression
  :  atom (EQUALS^ atom)*
  ;

atom
  :  STRING
  |  ID
  ;

OUTPUT_START  : '${'  {mmode=true;};
TAG_START     : '<#'  {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});

OUTPUT_END    : '}' {mmode=false;};
TAG_END       : '>' {mmode=false;};

EQUALS        : '==';
IF            : 'if';
STRING        : '"' ~'"'* '"';
ID            : ('a'..'z' | 'A'..'Z')+;
SPACE         : (' ' | '\t' | '\r' | '\n')+ {skip();};

RAW           : ({rawAhead()}?=> . )+;

The grammar above will produce the following AST from the input posted at the start of this answer:

这篇关于不同状态下的不同词法分析器规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆