Solr 6 和 Nutch 2.3.1 集成 [英] Solr 6 and Nutch 2.3.1 integration
问题描述
根据 Nutch 消息,最新版本的 Nutch 是 2.3.1,与 Solr 4.10.3 兼容,这是非常旧的 solr 版本.
According to Nutch news the latest version of Nutch is 2.3.1 compatible with Solr 4.10.3 which is very old version of solr.
我们可以将 Solr 6 与 Nutch 2.3.1 集成吗?如果集成solr 6会有什么缺点?有人试过吗?
Can we integrate Solr 6 with Nutch 2.3.1. What will be the drawbacks if solr 6 will be integrated? Anybody tried this?
推荐答案
这是一个老问题,但我刚刚让 Nutch 1.12 与 Solr 6.3.0 对话.Nutch 2.x 所需的架构/solrconfig 更改应该是相同的,所以这就是我所做的:
This is an old question but I just got Nutch 1.12 talking to Solr 6.3.0. The required schema/solrconfig changes should be the same for Nutch 2.x so here's what I did:
将两个产品下载并解压到某个目录中,例如~/mycrawler,然后进入solr目录,为nutch创建一个核心:
Download and extract both products into some directory, e.g. ~/mycrawler, then go into the solr directory and create a core for nutch:
solr-6.3.0/bin $ ./solr start
solr-6.3.0/bin $ ./solr create_core -c nutch -d basic_configs
solr-6.3.0/bin $ ./solr stop
这将创建 solr-6.3.0/server/solr/nutch,模式等将位于其中.现在,我们需要删除新的自动管理架构定义并将其替换为 nutch 提供的 schema.xml:
This will create solr-6.3.0/server/solr/nutch where the schema etc. will be located. Now, we need to remove the new auto-managed schema definition and replace it with the nutch-supplied schema.xml:
solr-6.3.0/server/solr/nutch/conf $ rm managed-schema
solr-6.3.0/server/solr/nutch/conf $ cp ~/mycrawler/apache-nutch-1.12/conf/schema.xml .
现在编辑 schema.xml 并删除所有 <filter class="solr.StopFilterFactory" ignoreCase="true" ...
的所有实例 enablePositionIncrements="true"
代码>定义.
Now edit schema.xml and remove all instances of enablePositionIncrements="true"
in all <filter class="solr.StopFilterFactory" ignoreCase="true" ...
definitions.
同样在 solr-6.3.0/server/solr/nutch/conf/solrconfig.xml
中,注释这些 typeMapping 块,所以你得到:
Also in solr-6.3.0/server/solr/nutch/conf/solrconfig.xml
, comment these typeMapping blocks, so you get:
<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
<str name="defaultFieldType">strings</str>
<!--
<lst name="typeMapping">
<str name="valueClass">java.lang.Boolean</str>
<str name="fieldType">booleans</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.util.Date</str>
<str name="fieldType">tdates</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Long</str>
<str name="valueClass">java.lang.Integer</str>
<str name="fieldType">tlongs</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.lang.Number</str>
<str name="fieldType">tdoubles</str>
</lst>
-->
</processor>
现在再次启动服务器:
solr-6.3.0/bin $ ./solr start
如果您转到 admin gui,它应该将核心显示为开始时没有进一步的架构问题.
If you go to the admin gui, it should show the core as started with no further schema issues.
现在可以运行爬网脚本并将成功写入我们最前沿的 Solr(这对于 Nutch 2 可能略有不同):
Now the crawl script can be run and will successfully write into our bleeding edge Solr (this is probably slightly different for Nutch 2):
./crawl -i \
-D solr.server.url=http://localhost:8983/solr/nutch \
~/mycrawler/nutch_work/seed \
~/mycrawler/nutch_work/crawl \
1
这篇关于Solr 6 和 Nutch 2.3.1 集成的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!