使用solr索引维基百科 [英] Indexing wikipedia with solr

查看:163
本文介绍了使用solr索引维基百科的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经安装了solr 4.6.0并按照Solr主页上提供的教程进行操作。一切都很好,直到我需要做一份我即将做的真正的工作。我必须快速访问维基百科内容,我被建议使用Solr。好吧,我试图按照 http://wiki.apache链接中的示例进行操作。 org / solr / DataImportHandler#例如:_Indexing_wikipedia ,但我无法得到这个例子。我是新手,我不知道data_config.xml是什么意思!

I've installed solr 4.6.0 and follow the tutorial available at Solr's home page. Everything was fine, untill I need to do a real job that I'm about to do. I have to get a fast access to wikipedia content and I was advised to use Solr. Well, I was trying to follow the example in the link http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia, but I couldn't get the example. I am newbie, and I don't know what means data_config.xml!

<dataConfig>
        <dataSource type="FileDataSource" encoding="UTF-8" />
        <document>
        <entity name="page"
                processor="XPathEntityProcessor"
                stream="true"
                forEach="/mediawiki/page/"
                url="/data/enwiki-20130102-pages-articles.xml"
                transformer="RegexTransformer,DateFormatTransformer"
                >
            <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
       </entity>
        </document>
</dataConfig>

我在Solr主目录中找不到。此外,我试图找到一些与我有关的问题,如何将.xml格式的维基百科文件索引到solr 使用solr索引维基百科转储,但他们没有解决我的疑问。

I couldn't find in the Solr home directory. Also, I tried to find some questions related to mine, How to index wikipedia files in .xml format into solr and Indexing wikipedia dump with solr, but they didn't solve my doubt.

我想我需要更基本的东西,一步一步指导我,因为教程在处理时很混乱索引维基百科。

I think I need something more basic, guiding me step by step, because the tutorial is confusing when deals with indexing wikipedia.

任何向folow提供指示的建议都会很好。

Any advice to give some directions to folow would be nice.

推荐答案

好吧,我在网上看了很多东西,试图收集尽可能多的信息。这就是我找到解决方案的方法:

Well, I've read many things on the Web and tried to collected as many information as possible. This is how I could find the solution:

这是我的solrconfig.xml:

here is my solrconfig.xml:

...
  <!-- ****** Data import handler -->
  <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
      <str name="config">data-config.xml</str>
    </lst>
  </requestHandler>
...
  <lib dir="../../../dist/" regex="solr-dataimporthandler-.*\.jar" />

这是我的data-config.xml :(重要的是:它必须位于solrconfig的同一文件夹中.xml)

Here is my data-config.xml: (important: it must be in the same folder of solrconfig.xml)

<dataConfig>
        <dataSource type="FileDataSource" encoding="UTF-8" />
        <document>
        <entity name="page"
                processor="XPathEntityProcessor"
                stream="true"
                forEach="/mediawiki/page/"
                url="/Applications/solr-4.6.0/example/exampledocs/simplewikiSubSet.xml"
                transformer="RegexTransformer,DateFormatTransformer"
                >
            <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
       </entity>
        </document>
</dataConfig>

注意:最后一行非常重要!

Attention: The last line is very important!

我的schema.xml:

My schema.xml:

...
   <field name="id"        type="string"  indexed="true" stored="true" required="true"/>
   <field name="title"     type="string"  indexed="true" stored="false"/>
   <field name="revision"  type="int"    indexed="true" stored="true"/>
   <field name="user"      type="string"  indexed="true" stored="true"/>
   <field name="userId"    type="int"     indexed="true" stored="true"/>
   <field name="text"      type="text_en"    indexed="true" stored="false"/>
   <field name="timestamp" type="date"    indexed="true" stored="true"/>
   <field name="titleText" type="text_en"    indexed="true" stored="true"/>
...
 <uniqueKey>id</uniqueKey>
...
   <copyField source="title" dest="titleText"/>
...

已经完成了。这就是所有人!

And it's done. That's all folks!

这篇关于使用solr索引维基百科的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆