使用solr索引维基百科转储 [英] Indexing wikipedia dump with solr

查看:200
本文介绍了使用solr索引维基百科转储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的机器上安装了solr 3.6.2,与tomcat完美运行。我想使用solr索引一个维基百科转储文件。如何使用DataImportHandler执行此操作?还有别的办法吗?我对xml一无所知。

I have solr 3.6.2 installed on my machine, perfectly running with tomcat. I want to index a wikipedia dump file using solr. How do I do this using DataImportHandler? Any other way? I don't have any knowledge of xml.

我提到的文件在提取时大小约为45GB。
任何帮助将不胜感激。

The file I have mentioned has size of around 45GB when extracted. Any help would be greatly appreciated.

更新 -
我试着在DataImportHandler页面上说什么。但是有一些错误可能是因为他们的solr版本更老了。

Update- I tried doing whats said on the DataImportHandler page. But there is some error maybe because their version of solr is much older.

我的data.config -

My data.config-

<dataConfig>
    <dataSource type="FileDataSource" encoding="UTF-8" />
    <document>
    <entity name="page"
            processor="XPathEntityProcessor"
            stream="true"
            forEach="/mediawiki/page/"
            url="./data/enwiki.xml"
            transformer="RegexTransformer,DateFormatTransformer"
            >
        <field column="id"        xpath="/mediawiki/page/id" />
        <field column="title"     xpath="/mediawiki/page/title" />
        <field column="revision"  xpath="/mediawiki/page/revision/id" />
        <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
        <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
        <field column="text"      xpath="/mediawiki/page/revision/text" />
        <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
        <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
   </entity>
    </document>

架构(我刚添加了他们在网站上给我的schema.xml文件的部分)

Schema (I just added the parts they have given on the website to my schema.xml file)

我得到的错误是 -

The error I am getting is -

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<lst name="initArgs">
<lst name="defaults">
<str name="config">solr-data-config.xml</str>
</lst>
</lst>
<str name="command">full-import</str>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages">
<str name="Time Elapsed">0:0:1.381</str>
<str name="Total Requests made to DataSource">0</str>
<str name="Total Rows Fetched">0</str>
<str name="Total Documents Processed">0</str>
<str name="Total Documents Skipped">0</str>
<str name="">Indexing failed. Rolled back all changes.</str>
<str name="Rolledback">2013-05-17 16:48:32</str>
</lst>
<str name="WARNING">
This response format is experimental. It is likely to change in the future.
</str>
</response>

请帮助

推荐答案

简单的帖子不是索引维基百科的正确方法。您需要查看使用DataImportHandler 。 DIH支持流媒体导入。

Simple post is not the right way to index Wikipedia. You need to look into using DataImportHandler instead. DIH supports streaming import.

这篇关于使用solr索引维基百科转储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆