使用 solr 索引维基百科 [英] Indexing wikipedia with solr

查看:24
本文介绍了使用 solr 索引维基百科的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经安装了 solr 4.6.0 并按照 Solr 主页上的教程进行操作.一切都很好,直到我需要做我即将做的真正工作.我必须快速访问维基百科内容,我被建议使用 Solr.好吧,我试图按照链接 http://wiki.apache 中的示例进行操作.org/solr/DataImportHandler#Example:_Indexing_wikipedia,但我找不到示例.我是新手,不知道data_config.xml是什么意思!

I've installed solr 4.6.0 and follow the tutorial available at Solr's home page. Everything was fine, untill I need to do a real job that I'm about to do. I have to get a fast access to wikipedia content and I was advised to use Solr. Well, I was trying to follow the example in the link http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia, but I couldn't get the example. I am newbie, and I don't know what means data_config.xml!

<dataConfig>
        <dataSource type="FileDataSource" encoding="UTF-8" />
        <document>
        <entity name="page"
                processor="XPathEntityProcessor"
                stream="true"
                forEach="/mediawiki/page/"
                url="/data/enwiki-20130102-pages-articles.xml"
                transformer="RegexTransformer,DateFormatTransformer"
                >
            <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
       </entity>
        </document>
</dataConfig>

我在 Solr 主目录中找不到.另外,我试图找到一些与我相关的问题,如何将 .xml 格式的维基百科文件索引到 solr使用 solr 索引维基百科转储,但他们没有解决我的疑问.

I couldn't find in the Solr home directory. Also, I tried to find some questions related to mine, How to index wikipedia files in .xml format into solr and Indexing wikipedia dump with solr, but they didn't solve my doubt.

我想我需要一些更基本的东西,一步一步地指导我,因为在处理索引维基百科时,本教程令人困惑.

I think I need something more basic, guiding me step by step, because the tutorial is confusing when deals with indexing wikipedia.

任何提供一些指导的建议都会很好.

Any advice to give some directions to folow would be nice.

推荐答案

嗯,我在网上阅读了很多东西,并试图收集尽可能多的信息.这是我找到解决方案的方法:

Well, I've read many things on the Web and tried to collected as many information as possible. This is how I could find the solution:

这是我的 solrconfig.xml:

here is my solrconfig.xml:

...
  <!-- ****** Data import handler -->
  <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
      <str name="config">data-config.xml</str>
    </lst>
  </requestHandler>
...
  <lib dir="../../../dist/" regex="solr-dataimporthandler-.*.jar" />

这是我的data-config.xml:(重要:它必须在solrconfig.xml的同一个文件夹中)

Here is my data-config.xml: (important: it must be in the same folder of solrconfig.xml)

<dataConfig>
        <dataSource type="FileDataSource" encoding="UTF-8" />
        <document>
        <entity name="page"
                processor="XPathEntityProcessor"
                stream="true"
                forEach="/mediawiki/page/"
                url="/Applications/solr-4.6.0/example/exampledocs/simplewikiSubSet.xml"
                transformer="RegexTransformer,DateFormatTransformer"
                >
            <field column="id"        xpath="/mediawiki/page/id" />
            <field column="title"     xpath="/mediawiki/page/title" />
            <field column="revision"  xpath="/mediawiki/page/revision/id" />
            <field column="user"      xpath="/mediawiki/page/revision/contributor/username" />
            <field column="userId"    xpath="/mediawiki/page/revision/contributor/id" />
            <field column="text"      xpath="/mediawiki/page/revision/text" />
            <field column="timestamp" xpath="/mediawiki/page/revision/timestamp" dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss'Z'" />
            <field column="$skipDoc"  regex="^#REDIRECT .*" replaceWith="true" sourceColName="text"/>
       </entity>
        </document>
</dataConfig>

注意:最后一行很重要!

Attention: The last line is very important!

我的 schema.xml:

My schema.xml:

...
   <field name="id"        type="string"  indexed="true" stored="true" required="true"/>
   <field name="title"     type="string"  indexed="true" stored="false"/>
   <field name="revision"  type="int"    indexed="true" stored="true"/>
   <field name="user"      type="string"  indexed="true" stored="true"/>
   <field name="userId"    type="int"     indexed="true" stored="true"/>
   <field name="text"      type="text_en"    indexed="true" stored="false"/>
   <field name="timestamp" type="date"    indexed="true" stored="true"/>
   <field name="titleText" type="text_en"    indexed="true" stored="true"/>
...
 <uniqueKey>id</uniqueKey>
...
   <copyField source="title" dest="titleText"/>
...

就这样完成了.都是这样!

And it's done. That's all folks!

这篇关于使用 solr 索引维基百科的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆