ANTLR 3中的Wikitext-to-HTML的工作示例 [英] Working example of wikitext-to-HTML in ANTLR 3

查看:77
本文介绍了ANTLR 3中的Wikitext-to-HTML的工作示例的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试充实ANTLR 3中的wikitext-to-HTML转换器,但我一直陷于困境.

I'm trying to flesh out a wikitext-to-HTML translator in ANTLR 3, but I keep getting stuck.

您知道我可以检查的有效示例吗?我尝试了MediaWiki ANTLR语法和Wiki Creole语法,但无法获取它们来生成词法分析器& ANTLR 3中的解析器.

Do you know of a working example that I can inspect? I tried the MediaWiki ANTLR grammar and the Wiki Creole grammar, but I can't get them to generate the lexer & parser in ANTLR 3.

以下是我尝试使用的两种语法的链接:

Here are the links to two grammars I've tried using:

  • http://www.mediawiki.org/wiki/Markup_spec/ANTLR
  • http://www.wikicreole.org/wiki/EBNFGrammarForCreole1.0

我无法获得这两者中的任何一个来生成我的Java Lexer和Parser. (我使用ANTLR3作为Eclipse插件). MediaWiki需要大量时间来构建,然后在某个时候抛出OutOfMemory异常.另一个有错误,我不知道如何调试.

I can't get any of these two to generate my Java Lexer and Parser. (I'm using ANTLR3 as Eclipse plugin). MediaWiki takes a looong time to build and then at some point it throws an OutOfMemory exception. The other one has errors in it which I don't know how to debug.

编辑:好的,我有一个非常基本的语法:

Okay I've got a very basic grammar:

grammar wikitext;

options {
  //output = AST;
  //ASTLabelType = CommonTree;
  output = template;
  language = Java;
}

document: line (NL line?)*;

line: horizontal_line | list | heading | paragraph;

/* horizontal line */
horizontal_line: HRLINE;

/* lists */
list: unordered_list | ordered_list;

unordered_list: '*'+ content;
ordered_list: '#'+ content;

/* Headings */
heading: heading1 | heading2 | heading3 | heading4 | heading5 | heading6;
heading1: H1 plain H1;
heading2: H2 plain H2;
heading3: H3 plain H3;
heading4: H4 plain H4;
heading5: H5 plain H5;
heading6: H6 plain H6;

/* Paragraph */
paragraph: content;

content: (formatted | link)+;

/* links */
link: external_link | internal_link;

external_link: '[' external_link_uri ('|' external_link_title)? ']';
internal_link: '[[' internal_link_ref ('|' internal_link_title)? ']]' ;

external_link_uri: CHARACTER+;
external_link_title: plain;
internal_link_ref: plain;
internal_link_title: plain;

/* bold & italic */
formatted: bold_italic | bold | italic | plain;

bold_italic: BOLD_ITALIC plain BOLD_ITALIC;
bold: BOLD plain BOLD;
italic: ITALIC plain ITALIC;

/* Plain text */
plain: (CHARACTER | SPACE)+;


/**
 * LEXER RULES
 * --------------------------------------------------------------------------
 */

HRLINE: '---' '-'+;

H1: '=';
H2: '==';
H3: '===';
H4: '====';
H5: '=====';
H6: '======';

BOLD_ITALIC: '\'\'\'\'\'';
BOLD: '\'\'\'';
ITALIC: '\'\'';

NL: '\r'?'\n';

CHARACTER       :       '!' | '"' | '#' | '$' | '%' | '&'
                |       '*' | '+' | ',' | '-' | '.' | '/'
                |       ':' | ';' | '?' | '@' | '\\' | '^' | '_' | '`' | '~'
                |       '0'..'9' | 'A'..'Z' |'a'..'z' 
                |       '\u0080'..'\u7fff'
                |       '(' | ')'
                |       '\'' | '<' | '>' | '=' | '[' | ']' | '|' 
                ;

SPACE: ' ' | '\t';

虽然我不清楚如何输出HTML.我一直在研究StringTemplate,但不了解如何构造模板.具体来说,哪个模板在语法中位于何处.您能帮我举个简短的例子吗?

It's not clear for me though how one would go about outputting HTML. I've been looking into StringTemplate, but I don't understand how to structure my templates. Specifically, which template goes where in the grammar. Can you help me with a short example?

推荐答案

好的,在您进行编辑后,我有一些建议.

Okay, after your EDIT, I have a couple of recommendations.

就像我在评论中说的那样,为这种语言编写语法几乎是不可能的.至少,一口气尝试这样做.我看到这项工作的唯一方法是使用多个解析器执行此操作,其中第一个解析阶段"将非常粗略地"解析Wiki源.例如:table将被标记为:TABLE : '{|' .* '|}',然后您将创建另一个解析器来正确解析该表.在一个解析器中执行此操作会导致您的解析器规则IMO中出现很多歧义.

Like I said in the comments, writing a grammar for such a language is nearly impossible. At least, trying to do so in one go, that is. The only way I see this working would be to do this with multiple parsers where the first "parsing-stage" would parse the wiki-source very "coarsely". For example: a table would be tokenized as: TABLE : '{|' .* '|}' and then you'd create another parser that parses this table properly. Doing it in one parser will result in quite a few ambiguities in your parser rules IMO.

关于发出HTML代码,使用StringTemplate确实是正确"的方法,但是鉴于您对ANTLR本身还很陌生,所以我将简单化.您可以在解析器类中创建一个StringBuilder属性,该属性将在解析源文件时收集所有HTML代码.您可以使用{}包装代码,将代码嵌入ANTLR规则.

About emitting HTML code, the "proper" way to do this is indeed with StringTemplate, but given the fact that you're rather new to ANTLR itself, I'd keep things simple. You could create a StringBuilder attribute in your parser class that would collect all your HTML code as you parse your source file. You can embed code in ANTLR rules by wrapping it with { and }.

这是一个快速演示:

grammar T;

@parser::members {

  // an attribute that is only available in your 
  // parser (so only in parser rules!)
  protected StringBuilder htmlBuilder = new StringBuilder();
}

// Parser rules
parse
  :  atom+ EOF
  ;

atom
  :  header
  |  Any    {htmlBuilder.append($Any.text);} // append the text from 'Any' token
  ;

header
  :  H3 h3Content H3 {htmlBuilder.append("<h3>" + $h3Content.text + "</h3>");}
  |  H2 h2Content H2 {htmlBuilder.append("<h2>" + $h2Content.text + "</h2>");}
  |  H1 h1Content H1 {htmlBuilder.append("<h1>" + $h1Content.text + "</h1>");}
  ;

h3Content : ~H3*; // match any token except H3, zero or more times
h2Content : ~H2*; //        "               H2          "
h1Content : ~H1*; //        "               H1          "

// Lexer rules    
H3 : '===';
H2 : '==';
H1 : '=';

// Fall through rule: if non of the above 
// lexer rules matched, this one will.
Any
  :  .
  ;

根据该语法,您将生成解析器和词法分析器:

From that grammar, you generate a parser and lexer:

java -cp antlr-3.2.jar org.antlr.Tool T.g

然后创建一个小类来测试您的解析器:

and then create a little class to test your parser:

import org.antlr.runtime.*;

public class Main {
    public static void main(String[] args) throws Exception {

        // the source to be parsed
        String source = 
                "= header 1 =             \n"+
                "                         \n"+
                "some text here           \n"+
                "                         \n"+
                "=== header level 3 ===   \n"+
                "                         \n"+
                "and some more text         ";

        ANTLRStringStream in = new ANTLRStringStream(source);
        TLexer lexer = new TLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        TParser parser = new TParser(tokens);

        // invoke the start-rule in your parser
        parser.parse();

        // print the contents of your parser's StringBuilder
        System.out.println(parser.htmlBuilder);
    }
}

,然后编译所有源文件:

and then compile all your source files:

javac -cp antlr-3.2.jar *.java

最后,运行您的主类

// *nix & MacOS
java -cp .:antlr-3.2.jar Main

// Windows
java -cp .;antlr-3.2.jar Main

它将在控制台上打印以下内容:

which will print the following to the console:

<h1> header 1 </h1>             

some text here           

<h3> header level 3 </h3>   

and some more text  

但是,再次重申,如果您可以自由选择另一种语言来解析,那么我会这样做,而不必解析这种可怕的Wiki内容.

But, again, if you are free to choose a different language to parse, I'd do that and forget about parsing this horrible Wiki-thing.

无论如何,无论做什么:祝您好运!

Anyway, whatever you do: best of luck!

这篇关于ANTLR 3中的Wikitext-to-HTML的工作示例的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆