Nutch不使用Mongodb正确使用弹性搜索索引 [英] Nutch does not Index on Elasticsearch correctly using Mongodb

查看:109
本文介绍了Nutch不使用Mongodb正确使用弹性搜索索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行Nutch 2.3.1,Mongodb 3.2.9和Elasticsearch 2.4.1。我遵循了本教程的一个组合:



https://qbox.io/blog/scraping-the-web-with-nutch-for-elasticsearch



和本教程:



http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/



为了使用上述3个软件创建一个网页抓取工具。



一旦我使用来自nutch的索引命令,一切都会很好,直到下载到索引...

 #bin / nutch index elasticsearch -all 

发生:

  IndexingJob:starting 
Active IndexWriters:
ElasticIndexWriter
elastic.cluster:elastic前缀集群
elastic.host:hostname
elastic.port:port(默认9300)
elastic.index:弹性索引命令
elastic.max.bulk.docs:ealstic批量索引文件计数。 (默认250)
elastic.max.bulk.size:弹性体索引长度。 (默认2500500〜2.5MB)

IndexingJob:done。

我的nutch-site.xml:

 <结构> 
< property>
< name> storage.data.store.class< / name>
< value> org.apache.gora.mongodb.store.MongoStore< / value>
< description>用于存储数据的默认类< / description>
< / property>
< property>
< name> http.agent.name< / name>
< value> AOssama Crawler< / value>
< / property>

< property>
< name> plugin.includes< / name>
< value> protocol-(http | httpclient)| urlfilter-regex | index-(basic | more)| query-(basic | site | url | lang)| indexer-elastic | nutch-extensionpoints | parse- (文|全文| msexcel的| MSWORD | mspowerpoint | PDF)|汇总基本|记分OPIC | urlnormalizer-(通|正则表达式|基本)| parse-(HTML |蒂卡|元标记)|指数 - (基本|锚|更多|元数据)< /值GT;
< / property>
< property>
< name> elastic.host< / name>
< value> localhost< / value>
< / property>

< property>
< name> elastic.cluster< / name>
< value> aossama< / value>
< / property>

< property>
< name> elastic.index< / name>
< value> nutch< / value>
< / property>

< property>
< name> parser.character.encoding.default< / name>
< value> utf-8< / value>
< / property>

< property>
< name> http.content.limit< / name>
< value> 6553600< / value>
< / property>
< / configuration>

我还研究了ElasticIndexWriter.java代码,并注意到250行附近调用ElasticIndexWriter的类。我现在正在进一步挖掘,但是我完全失去了为什么这不与Mongo合作。我要放弃,并尝试与Hbase尽可能多的我不喜欢。



谢谢!



Joe

解决方案

经过很多麻烦,我得到它的工作。我最终使用ES 1.4.4,nutch 2.3.1,mongodb 3.10和JDK 8.



我经历的许多问题在许多方面仍然没有得到答复其他线程:




  • (这是一个简单的,但...)确保一切正常运行。 make
    确保弹性搜索正在使用
    正确的端口正确运行。确保你可以和它说话。确保MongoDB是
    并在正确的端口上运行,请确保您可以与之对话。

  • 使用正确的索引命令。对于Nutch 3.2.1它是:
    ./ bin / nutch index -all (在您获取和解析之后)。如果您遇到一个solr错误,您的nutch-site.xml中没有正确的索引功能。

  • 在您的elasticsearch.yml和您的nutch-site.xml中为您的履带式引擎命名。这是巨大的这是我的索引函数中抛出任何错误的主要原因。

  • 版本控制。我试图用更新的Elasticsearch版本来做到这一点,并经常遇到问题。我将尝试在最新版本的Elasticsearch和Mongo上构建它,并回到这个线程。尝试使用我先前创建的相同的构建,然后尝试其他构建。由于与ivy / ivy.xml设置中的gora以及indexer-elastic / plugin.xml设置的依赖关系,使用nutch的Elasticsearch版本控制似乎是最重要的部分。



请让我知道,如果你有任何麻烦。我花了2个星期的时间来完成这个构建,我知道这可能令人难以置信的令人沮丧。如果你遇到问题,请问我或发布这个问题,我相信我可以帮助你解决问题。



Joe


I am running Nutch 2.3.1, Mongodb 3.2.9, and Elasticsearch 2.4.1. I have followed a mix of this tutorial:

https://qbox.io/blog/scraping-the-web-with-nutch-for-elasticsearch

and this tutorial:

http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/

In order to create a web crawling tool using those aforementioned 3 pieces of software.

Everything works great until it comes down to indexing... as soon as I use the index command from nutch:

# bin/nutch index elasticsearch -all

this happens:

IndexingJob: starting
Active IndexWriters :
ElasticIndexWriter
        elastic.cluster : elastic prefix cluster
        elastic.host : hostname
        elastic.port : port (default 9300)
        elastic.index : elastic index command
        elastic.max.bulk.docs : ealstic bulk index doc counts. (default 250)
        elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

IndexingJob: done.

My nutch-site.xml:

<configuration>
  <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.mongodb.store.MongoStore</value>
    <description>Default class for storing data</description>
  </property>
  <property>
    <name>http.agent.name</name>
    <value>AOssama Crawler</value>
  </property>

  <property>
    <name>plugin.includes</name>
    <value>protocol-(http|httpclient)|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-elastic|nutch-extensionpoints|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
  </property>
  <property>
    <name>elastic.host</name>
    <value>localhost</value>
  </property>

  <property>
    <name>elastic.cluster</name>
    <value>aossama</value>
  </property>

  <property>
    <name>elastic.index</name>
    <value>nutch</value>
  </property>

  <property>
    <name>parser.character.encoding.default</name>
    <value>utf-8</value>
  </property>

  <property>
    <name>http.content.limit</name>
    <value>6553600</value>
  </property>
</configuration>

I also looked into the ElasticIndexWriter.java code and noticed near line 250 the class that calls the ElasticIndexWriter. I'm digging into that further now, but I'm completely lost as to why this isn't working with Mongo. I'm about to give up and try with Hbase as much as I dislike it.

Thanks!

Joe

解决方案

After a lot of trouble I got it working. I ended up using ES 1.4.4, nutch 2.3.1, mongodb 3.10, and JDK 8.

Many of the issues I went through that remained unanswered in a number of other threads:

  • (this is an easy one but...) MAKE SURE EVERYTHING IS RUNNING. Make sure elasticsearch is running on the correct machine with the correct port. Make sure you can talk to it. Make sure MongoDB is up and running on the correct port, make sure you can talk to it.
  • Use the correct index command. for Nutch 3.2.1 it's: ./bin/nutch index -all (after you fetch and parse). If you run into a solr error, you do not have the correct index funtion in your nutch-site.xml.
  • Name your crawler engine the SAME THING in your elasticsearch.yml and your nutch-site.xml. This was huge. This is the main reason I had any error thrown in my index function.
  • Versioning. I tried to do this with the newer versions of Elasticsearch and frequently ran into problems. I am going to attempt to build this on the newest version of Elasticsearch and Mongo and get back to this thread. Try to use the same build I did first, then attempt the other builds. Elasticsearch versioning with nutch seems to be the most important part because of the dependencies regarding gora in the ivy/ivy.xml settings as well as the indexer-elastic/plugin.xml settings.

Please, please, please, let me know if you're having any trouble with this. It took me close to 2 full weeks to figure this build out and I know it can be incredibly frustrating. PM me or post on this if you're running into issues, I'm sure I can help you work through them.

Joe

这篇关于Nutch不使用Mongodb正确使用弹性搜索索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆