ANTLR 3 中 wikitext-to-HTML 的工作示例 [英] Working example of wikitext-to-HTML in ANTLR 3

查看:29
本文介绍了ANTLR 3 中 wikitext-to-HTML 的工作示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在 ANTLR 3 中充实一个 wikitext-to-HTML 翻译器,但我一直被卡住.

I'm trying to flesh out a wikitext-to-HTML translator in ANTLR 3, but I keep getting stuck.

你知道我可以检查的工作示例吗?我尝试了 MediaWiki ANTLR 语法和 Wiki Creole 语法,但我无法让它们生成词法分析器 &ANTLR 3 中的解析器.

Do you know of a working example that I can inspect? I tried the MediaWiki ANTLR grammar and the Wiki Creole grammar, but I can't get them to generate the lexer & parser in ANTLR 3.

以下是我尝试使用的两种语法的链接:

Here are the links to two grammars I've tried using:

我无法使用这两个中的任何一个来生成我的 Java Lexer 和 Parser.(我使用 ANTLR3 作为 Eclipse 插件).MediaWiki 需要很长时间来构建,然后在某些时候它会抛出 OutOfMemory 异常.另一个有错误,我不知道如何调试.

I can't get any of these two to generate my Java Lexer and Parser. (I'm using ANTLR3 as Eclipse plugin). MediaWiki takes a looong time to build and then at some point it throws an OutOfMemory exception. The other one has errors in it which I don't know how to debug.

好吧,我有一个非常基本的语法:

grammar wikitext;

options {
  //output = AST;
  //ASTLabelType = CommonTree;
  output = template;
  language = Java;
}

document: line (NL line?)*;

line: horizontal_line | list | heading | paragraph;

/* horizontal line */
horizontal_line: HRLINE;

/* lists */
list: unordered_list | ordered_list;

unordered_list: '*'+ content;
ordered_list: '#'+ content;

/* Headings */
heading: heading1 | heading2 | heading3 | heading4 | heading5 | heading6;
heading1: H1 plain H1;
heading2: H2 plain H2;
heading3: H3 plain H3;
heading4: H4 plain H4;
heading5: H5 plain H5;
heading6: H6 plain H6;

/* Paragraph */
paragraph: content;

content: (formatted | link)+;

/* links */
link: external_link | internal_link;

external_link: '[' external_link_uri ('|' external_link_title)? ']';
internal_link: '[[' internal_link_ref ('|' internal_link_title)? ']]' ;

external_link_uri: CHARACTER+;
external_link_title: plain;
internal_link_ref: plain;
internal_link_title: plain;

/* bold & italic */
formatted: bold_italic | bold | italic | plain;

bold_italic: BOLD_ITALIC plain BOLD_ITALIC;
bold: BOLD plain BOLD;
italic: ITALIC plain ITALIC;

/* Plain text */
plain: (CHARACTER | SPACE)+;


/**
 * LEXER RULES
 * --------------------------------------------------------------------------
 */

HRLINE: '---' '-'+;

H1: '=';
H2: '==';
H3: '===';
H4: '====';
H5: '=====';
H6: '======';

BOLD_ITALIC: '\'\'\'\'\'';
BOLD: '\'\'\'';
ITALIC: '\'\'';

NL: '\r'?'\n';

CHARACTER       :       '!' | '"' | '#' | '$' | '%' | '&'
                |       '*' | '+' | ',' | '-' | '.' | '/'
                |       ':' | ';' | '?' | '@' | '\\' | '^' | '_' | '`' | '~'
                |       '0'..'9' | 'A'..'Z' |'a'..'z' 
                |       '\u0080'..'\u7fff'
                |       '(' | ')'
                |       '\'' | '<' | '>' | '=' | '[' | ']' | '|' 
                ;

SPACE: ' ' | '\t';

我不清楚如何输出 HTML.我一直在研究 StringTemplate,但我不明白如何构建我的模板.具体来说,哪个模板在语法中的哪个位置.你能帮我举一个简短的例子吗?

It's not clear for me though how one would go about outputting HTML. I've been looking into StringTemplate, but I don't understand how to structure my templates. Specifically, which template goes where in the grammar. Can you help me with a short example?

推荐答案

好的,在您的编辑之后,我有一些建议.

Okay, after your EDIT, I have a couple of recommendations.

就像我在评论中所说的那样,为这种语言编写语法几乎是不可能的.至少,尝试一下子做到这一点.我认为这项工作的唯一方法是使用多个解析器来执行此操作,其中第一个解析阶段"将非常粗略"地解析 wiki 源代码.例如:table 将被标记为: TABLE : '{|'.* '|}' 然后你会创建另一个解析器来正确解析这个表.在一个解析器中执行此操作会导致您的解析器规则 IMO 中存在不少歧义.

Like I said in the comments, writing a grammar for such a language is nearly impossible. At least, trying to do so in one go, that is. The only way I see this working would be to do this with multiple parsers where the first "parsing-stage" would parse the wiki-source very "coarsely". For example: a table would be tokenized as: TABLE : '{|' .* '|}' and then you'd create another parser that parses this table properly. Doing it in one parser will result in quite a few ambiguities in your parser rules IMO.

关于发出 HTML 代码,执行此操作的正确"方法确实是使用 StringTemplate,但鉴于您对 ANTLR 本身还很陌生,我会保持简单.您可以在解析器类中创建一个 StringBuilder 属性,它会在您解析源文件时收集所有 HTML 代码.您可以通过用 {} 包装代码,将代码嵌入到 ANTLR 规则中.

About emitting HTML code, the "proper" way to do this is indeed with StringTemplate, but given the fact that you're rather new to ANTLR itself, I'd keep things simple. You could create a StringBuilder attribute in your parser class that would collect all your HTML code as you parse your source file. You can embed code in ANTLR rules by wrapping it with { and }.

这是一个快速演示:

grammar T;

@parser::members {

  // an attribute that is only available in your 
  // parser (so only in parser rules!)
  protected StringBuilder htmlBuilder = new StringBuilder();
}

// Parser rules
parse
  :  atom+ EOF
  ;

atom
  :  header
  |  Any    {htmlBuilder.append($Any.text);} // append the text from 'Any' token
  ;

header
  :  H3 h3Content H3 {htmlBuilder.append("<h3>" + $h3Content.text + "</h3>");}
  |  H2 h2Content H2 {htmlBuilder.append("<h2>" + $h2Content.text + "</h2>");}
  |  H1 h1Content H1 {htmlBuilder.append("<h1>" + $h1Content.text + "</h1>");}
  ;

h3Content : ~H3*; // match any token except H3, zero or more times
h2Content : ~H2*; //        "               H2          "
h1Content : ~H1*; //        "               H1          "

// Lexer rules    
H3 : '===';
H2 : '==';
H1 : '=';

// Fall through rule: if non of the above 
// lexer rules matched, this one will.
Any
  :  .
  ;

从那个语法中,你生成一个解析器和词法分析器:

From that grammar, you generate a parser and lexer:

java -cp antlr-3.2.jar org.antlr.Tool T.g

然后创建一个小类来测试你的解析器:

and then create a little class to test your parser:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {

        // the source to be parsed
        String source = 
                "= header 1 =             \n"+
                "                         \n"+
                "some text here           \n"+
                "                         \n"+
                "=== header level 3 ===   \n"+
                "                         \n"+
                "and some more text         ";

        ANTLRStringStream in = new ANTLRStringStream(source);
        TLexer lexer = new TLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TParser parser = new TParser(tokens);

        // invoke the start-rule in your parser
        parser.parse();

        // print the contents of your parser's StringBuilder
        System.out.println(parser.htmlBuilder);
    }
}

然后编译所有源文件:

javac -cp antlr-3.2.jar *.java

最后,运行你的主类

// *nix & MacOS
java -cp .:antlr-3.2.jar Main

// Windows
java -cp .;antlr-3.2.jar Main

它将打印以下内容到控制台:

which will print the following to the console:

<h1> header 1 </h1>             

some text here           

<h3> header level 3 </h3>   

and some more text  

但是,同样,如果您可以自由选择不同的语言进行解析,我会这样做并且忘记解析这个可怕的 Wiki 东西.

But, again, if you are free to choose a different language to parse, I'd do that and forget about parsing this horrible Wiki-thing.

无论如何,无论你做什么:祝你好运!

Anyway, whatever you do: best of luck!

这篇关于ANTLR 3 中 wikitext-to-HTML 的工作示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆