Solr 6 和 Nutch 2.3.1 集成 [英] Solr 6 and Nutch 2.3.1 integration

查看:32
本文介绍了Solr 6 和 Nutch 2.3.1 集成的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据 Nutch 消息,最新版本的 Nutch 是 2.3.1,与 Solr 4.10.3 兼容,这是非常旧的 solr 版本.

According to Nutch news the latest version of Nutch is 2.3.1 compatible with Solr 4.10.3 which is very old version of solr.

我们可以将 Solr 6 与 Nutch 2.3.1 集成吗?如果集成solr 6会有什么缺点?有人试过吗?

Can we integrate Solr 6 with Nutch 2.3.1. What will be the drawbacks if solr 6 will be integrated? Anybody tried this?

推荐答案

这是一个老问题,但我刚刚让 Nutch 1.12 与 Solr 6.3.0 对话.Nutch 2.x 所需的架构/solrconfig 更改应该是相同的,所以这就是我所做的:

This is an old question but I just got Nutch 1.12 talking to Solr 6.3.0. The required schema/solrconfig changes should be the same for Nutch 2.x so here's what I did:

将两个产品下载并解压到某个目录中,例如~/mycrawler,然后进入solr目录,为nutch创建一个核心:

Download and extract both products into some directory, e.g. ~/mycrawler, then go into the solr directory and create a core for nutch:

solr-6.3.0/bin $ ./solr start
solr-6.3.0/bin $ ./solr create_core -c nutch -d basic_configs
solr-6.3.0/bin $ ./solr stop

这将创建 solr-6.3.0/server/solr/nutch,模式等将位于其中.现在,我们需要删除新的自动管理架构定义并将其替换为 nutch 提供的 schema.xml:

This will create solr-6.3.0/server/solr/nutch where the schema etc. will be located. Now, we need to remove the new auto-managed schema definition and replace it with the nutch-supplied schema.xml:

solr-6.3.0/server/solr/nutch/conf $ rm managed-schema
solr-6.3.0/server/solr/nutch/conf $ cp ~/mycrawler/apache-nutch-1.12/conf/schema.xml .

现在编辑 schema.xml 并删除所有 <filter class="solr.StopFilterFactory" ignoreCase="true" ... 的所有实例 enablePositionIncrements="true"代码>定义.

Now edit schema.xml and remove all instances of enablePositionIncrements="true" in all <filter class="solr.StopFilterFactory" ignoreCase="true" ... definitions.

同样在 solr-6.3.0/server/solr/nutch/conf/solrconfig.xml 中,注释这些 typeMapping 块,所以你得到:

Also in solr-6.3.0/server/solr/nutch/conf/solrconfig.xml, comment these typeMapping blocks, so you get:

<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
  <str name="defaultFieldType">strings</str>
    <!--
  <lst name="typeMapping">
    <str name="valueClass">java.lang.Boolean</str>
    <str name="fieldType">booleans</str>
  </lst>
  <lst name="typeMapping">
    <str name="valueClass">java.util.Date</str>
    <str name="fieldType">tdates</str>
  </lst>
  <lst name="typeMapping">
    <str name="valueClass">java.lang.Long</str>
    <str name="valueClass">java.lang.Integer</str>
    <str name="fieldType">tlongs</str>
  </lst>
  <lst name="typeMapping">
    <str name="valueClass">java.lang.Number</str>
    <str name="fieldType">tdoubles</str>
  </lst>
    -->
</processor>

现在再次启动服务器:

solr-6.3.0/bin $ ./solr start

如果您转到 admin gui,它应该将核心显示为开始时没有进一步的架构问题.

If you go to the admin gui, it should show the core as started with no further schema issues.

现在可以运行爬网脚本并将成功写入我们最前沿的 Solr(这对于 Nutch 2 可能略有不同):

Now the crawl script can be run and will successfully write into our bleeding edge Solr (this is probably slightly different for Nutch 2):

./crawl -i \
    -D solr.server.url=http://localhost:8983/solr/nutch \ 
    ~/mycrawler/nutch_work/seed \
    ~/mycrawler/nutch_work/crawl  \
    1

这篇关于Solr 6 和 Nutch 2.3.1 集成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆