维基百科:用于删除维基百科文本标记删除的Java库 [英] Wikipedia : Java library to remove wikipedia text markup removal

查看:176
本文介绍了维基百科:用于删除维基百科文本标记删除的Java库的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我下载了wikipedia转储,现在想要删除每个页面内容中的维基百科标记。我尝试编写正则表达式,但它们太多而无法处理。我找到了一个python库,但我需要一个java库,因为我想要集成到我的代码中。

I downloaded wikipedia dump and now want to remove the wikipedia markup in the contents of each page. I tried writing regular expressions but they are too many to handle. I found a python library but I need a java library because, I want to integrate into my code.

谢谢。

推荐答案

分两步完成:


  1. 让一些现有工具将MediaWiki标记转换为纯HTML;

  2. 转换纯HTML到文本。

以下演示:

import net.java.textilej.parser.MarkupParser;
import net.java.textilej.parser.builder.HtmlDocumentBuilder;
import net.java.textilej.parser.markup.mediawiki.MediaWikiDialect;
import javax.swing.text.html.HTMLEditorKit;
import javax.swing.text.html.parser.ParserDelegator;
import java.io.StringReader;
import java.io.StringWriter;

public class Test {

    public static void main(String[] args) throws Exception {

        String markup = "This is ''italic'' and '''that''' is bold. \n"+
                "=Header 1=\n"+
                "a list: \n* item A \n* item B \n* item C";

        StringWriter writer = new StringWriter();

        HtmlDocumentBuilder builder = new HtmlDocumentBuilder(writer);
        builder.setEmitAsDocument(false);

        MarkupParser parser = new MarkupParser(new MediaWikiDialect());
        parser.setBuilder(builder);
        parser.parse(markup);

        final String html = writer.toString();
        final StringBuilder cleaned = new StringBuilder();

        HTMLEditorKit.ParserCallback callback = new HTMLEditorKit.ParserCallback() {
                public void handleText(char[] data, int pos) {
                    cleaned.append(new String(data)).append(' ');
                }
        };
        new ParserDelegator().parse(new StringReader(html), callback, false);

        System.out.println(markup);
        System.out.println("---------------------------");
        System.out.println(html);
        System.out.println("---------------------------");
        System.out.println(cleaned);
    }
}

产生:

This is ''italic'' and '''that''' is bold. 
=Header 1=
a list: 
* item A 
* item B 
* item C
---------------------------
<p>This is <i>italic</i> and <b>that</b> is bold. </p><h1 id="Header1">Header 1</h1><p>a list: </p><ul><li>item A </li><li>item B </li><li>item C</li></ul>
---------------------------
This is  italic  and  that  is bold. Header 1 a list: item A item B item C 




你在哪里下载你要导入的java包吗?

Where do you download the java packages you are importing?

这里: download.java.net/maven/2/net/的网站档案链接java / textile-j / 2.2

这篇关于维基百科:用于删除维基百科文本标记删除的Java库的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆