使用 solr 索引维基百科转储 [英] Indexing wikipedia dump with solr

查看:25
本文介绍了使用 solr 索引维基百科转储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的机器上安装了solr 3.6.2,与tomcat完美运行.我想使用 solr 索引维基百科转储文件.如何使用 DataImportHandler 执行此操作?还有什么办法吗?我对 xml 一无所知.

I have solr 3.6.2 installed on my machine, perfectly running with tomcat. I want to index a wikipedia dump file using solr. How do I do this using DataImportHandler? Any other way? I don't have any knowledge of xml.

我提到的文件在提取时大约有 45GB.任何帮助将不胜感激.

The file I have mentioned has size of around 45GB when extracted. Any help would be greatly appreciated.

更新-我尝试做 DataImportHandler 页面上所说的.但是有一些错误可能是因为他们的 solr 版本要旧得多.

Update- I tried doing whats said on the DataImportHandler page. But there is some error maybe because their version of solr is much older.

我的 data.config-

My data.config-

<dataConfig>
    <dataSource type="FileDataSource" encoding="UTF-8" />
    <document>
    <entity name="page"
            processor="XPathEntityProcessor"
            stream="true"
            forEach="/mediawiki/page/"
            url="./data/enwiki.xml"
            transformer="RegexTransformer,DateFormatTransformer"
            >
        <field column="id"        xpath="/mediawiki/page/id" />
        <field column="title"     xpath="/mediawiki/page/title" />
        <field column="revision"  xpath="/mediawiki/page/revision/id" />
        <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
        <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
        <field column="text"      xpath="/mediawiki/page/revision/text" />
        <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
        <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
   </entity>
    </document>

Schema(我只是将他们在网站上提供的部分添加到我的 schema.xml 文件中)

Schema (I just added the parts they have given on the website to my schema.xml file)

我得到的错误是 -

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">solr-data-config.xml</str>
</lst>
</lst>
<str name="command">full-import</str>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Time Elapsed">0:0:1.381</str>
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">0</str>
<str name="Total Documents Processed">0</str>
<str name="Total Documents Skipped">0</str>
<str name="">Indexing failed. Rolled back all changes.</str>
<str name="Rolledback">2013-05-17 16:48:32</str>
</lst>
<str name="WARNING">
This response format is experimental. It is likely to change in the future.
</str>
</response>

请帮忙

推荐答案

简单的帖子不是索引维基百科的正确方法.您需要研究改用DataImportHandler.DIH 支持流式导入.

Simple post is not the right way to index Wikipedia. You need to look into using DataImportHandler instead. DIH supports streaming import.

这篇关于使用 solr 索引维基百科转储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆