带有Apache Solr 6.2.1的Apache Nutch 1.12给出错误 [英] Apache Nutch 1.12 with Apache Solr 6.2.1 give an error

查看:86
本文介绍了带有Apache Solr 6.2.1的Apache Nutch 1.12给出错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Apache Nutch 1.12和Apache Solr 6.2.1在Internet上爬网并将其编入索引,并且组合给出错误: java.lang.Exception:java.lang.IllegalStateException:连接池关闭下

I am using Apache Nutch 1.12 and Apache Solr 6.2.1 to crawl data on the internet and index them, and the combination gives an error: java.lang.Exception: java.lang.IllegalStateException: Connection pool shut down

我从Nutch教程中学到了以下内容: https://wiki.apache .org/nutch/NutchTutorial

I have done the following as I have learned from the Nutch tutorial: https://wiki.apache.org/nutch/NutchTutorial

  • 复制Nutch的schema.xml并将其放置在Solr的配置文件夹中
  • 将(报纸公司的)种子网址放在Nutch的urls/seed.txt中
  • 在nutch-site.xml中将http.content.limit值更改为"-1".由于种子网址是报纸公司之一,因此我只需要消除http内容下载大小限制

运行以下命令时,出现错误:

When I run the following command, I get an error:

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TSolr urls/ TestCrawl/ 2

上面,TSolr只是Solr Core的名称,您可能已经猜到了.

Above, TSolr is just the name of the Solr Core as you can probably guess already.

我在下面的hadoop.log中粘贴了错误日志:

I am pasting the error log in hadoop.log below:

    2016-10-28 16:21:20,982 INFO  indexer.IndexerMapReduce - IndexerMapReduce: crawldb: TestCrawl/crawldb
2016-10-28 16:21:20,982 INFO  indexer.IndexerMapReduce - IndexerMapReduce: linkdb: TestCrawl/linkdb
2016-10-28 16:21:20,982 INFO  indexer.IndexerMapReduce - IndexerMapReduces: adding segment: TestCrawl/segments/20161028161642
2016-10-28 16:21:46,353 WARN  conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1281422650/.staging/job_local1281422650_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-10-28 16:21:46,355 WARN  conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1281422650/.staging/job_local1281422650_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-10-28 16:21:46,415 WARN  conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1281422650_0001/job_local1281422650_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-10-28 16:21:46,416 WARN  conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1281422650_0001/job_local1281422650_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-10-28 16:21:46,565 INFO  anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-10-28 16:21:52,308 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-10-28 16:21:52,383 INFO  solr.SolrMappingReader - source: content dest: content
2016-10-28 16:21:52,383 INFO  solr.SolrMappingReader - source: title dest: title
2016-10-28 16:21:52,383 INFO  solr.SolrMappingReader - source: host dest: host
2016-10-28 16:21:52,383 INFO  solr.SolrMappingReader - source: segment dest: segment
2016-10-28 16:21:52,383 INFO  solr.SolrMappingReader - source: boost dest: boost
2016-10-28 16:21:52,383 INFO  solr.SolrMappingReader - source: digest dest: digest
2016-10-28 16:21:52,383 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
2016-10-28 16:21:52,424 INFO  solr.SolrIndexWriter - Indexing 42/42 documents
2016-10-28 16:21:52,424 INFO  solr.SolrIndexWriter - Deleting 0 documents
2016-10-28 16:21:53,468 INFO  solr.SolrMappingReader - source: content dest: content
2016-10-28 16:21:53,468 INFO  solr.SolrMappingReader - source: title dest: title
2016-10-28 16:21:53,468 INFO  solr.SolrMappingReader - source: host dest: host
2016-10-28 16:21:53,468 INFO  solr.SolrMappingReader - source: segment dest: segment
2016-10-28 16:21:53,468 INFO  solr.SolrMappingReader - source: boost dest: boost
2016-10-28 16:21:53,468 INFO  solr.SolrMappingReader - source: digest dest: digest
2016-10-28 16:21:53,469 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
2016-10-28 16:21:53,472 INFO  indexer.IndexingJob - Indexer: number of documents indexed, deleted, or skipped:
2016-10-28 16:21:53,476 INFO  indexer.IndexingJob - Indexer:     42  indexed (add/update)
2016-10-28 16:21:53,477 INFO  indexer.IndexingJob - Indexer: finished at 2016-10-28 16:21:53, elapsed: 00:00:32
2016-10-28 16:21:54,199 INFO  indexer.CleaningJob - CleaningJob: starting at 2016-10-28 16:21:54
2016-10-28 16:21:54,344 WARN  util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-10-28 16:22:19,739 WARN  conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1653313730/.staging/job_local1653313730_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-10-28 16:22:19,741 WARN  conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1653313730/.staging/job_local1653313730_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-10-28 16:22:19,797 WARN  conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1653313730_0001/job_local1653313730_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2016-10-28 16:22:19,799 WARN  conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1653313730_0001/job_local1653313730_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.
2016-10-28 16:22:19,807 WARN  output.FileOutputCommitter - Output Path is null in setupJob()
2016-10-28 16:22:25,113 INFO  indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-10-28 16:22:25,188 INFO  solr.SolrMappingReader - source: content dest: content
2016-10-28 16:22:25,188 INFO  solr.SolrMappingReader - source: title dest: title
2016-10-28 16:22:25,188 INFO  solr.SolrMappingReader - source: host dest: host
2016-10-28 16:22:25,188 INFO  solr.SolrMappingReader - source: segment dest: segment
2016-10-28 16:22:25,188 INFO  solr.SolrMappingReader - source: boost dest: boost
2016-10-28 16:22:25,188 INFO  solr.SolrMappingReader - source: digest dest: digest
2016-10-28 16:22:25,188 INFO  solr.SolrMappingReader - source: tstamp dest: tstamp
2016-10-28 16:22:25,191 INFO  solr.SolrIndexWriter - SolrIndexer: deleting 6/6 documents
2016-10-28 16:22:25,300 WARN  output.FileOutputCommitter - Output Path is null in cleanupJob()
2016-10-28 16:22:25,301 WARN  mapred.LocalJobRunner - job_local1653313730_0001
java.lang.Exception: java.lang.IllegalStateException: Connection pool shut down
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.IllegalStateException: Connection pool shut down
    at org.apache.http.util.Asserts.check(Asserts.java:34)
    at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169)
    at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202)
    at org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)
    at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
    at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:480)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
    at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
    at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)
    at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:483)
    at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:464)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:190)
    at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)
    at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
    at org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)
    at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
    at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
2016-10-28 16:22:25,841 ERROR indexer.CleaningJob - CleaningJob: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
    at org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:172)
    at org.apache.nutch.indexer.CleaningJob.run(CleaningJob.java:195)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:206)

正如您在上面的bin/crawl命令中看到的那样,我让Nutch运行了2轮抓取.事实是,以上错误仅发生在第二轮(种子位置深1级)上.因此,索引在第一轮中可以成功工作,但是在第二次爬网和第二轮解析之后,它会吐出错误并停止.

As you can see in the bin/crawl command above, I had Nutch run crawl for 2 rounds. The thing is that the error above only occurs on the second round (1 level deeper of the seed site). So, indexing works successfully on the first round, but after the second crawl and parse for the second round, it spits out the error and stops.

为了尝试与上面的第一次尝试有些不同,我在第二次尝试中做了以下操作:

To try things a bit differently from the first run as I have done above, I did the following on the second run:

  • 已删除TestCrawl文件夹以开始爬网并为新索引建立索引
  • 跑:bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TSolr urls/ TestCrawl/ 1 ==>请注意,我已将Nutch的回合数更改为"1".而且,这成功执行了爬网和索引
  • 然后,在第二轮再次运行相同的命令以更深入地爬升1级:bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TSolr urls/ TestCrawl/ 1 ==>这给了我与上面粘贴hadoop.log相同的错误!
  • Deleted TestCrawl folder to start crawl and index fresh new
  • ran: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TSolr urls/ TestCrawl/ 1 ==> note that I have changed the number of round for Nutch to "1". And, this executes crawling and indexing successfully
  • Then, ran the same command again for the second round to crawl 1 level deeper: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TSolr urls/ TestCrawl/ 1 ==> which gives me the same error as I have pasted the hadoop.log above!!

因此,因为我的Solr无法成功索引Nutch在种子站点第二轮或更深一级爬网时爬行的内容.

Therefore, for my Solr is NOT able to successfully index what Nutch crawled for the second round or 1 level deeper of the seed site.

错误可能是由于种子站点的已解析内容大小引起的吗?种子站点是一家报纸公司的网站,因此我确信第二轮(更深一层)将包含大量解析为索引的数据.如果问题是解析的内容大小,如何配置Solr来解决问题?

Could the error be due to the parsed contents size of the seed site? The seed site is a newspaper company's website, so I am sure that the second round (1 level deeper) would contain a hugh amount of data parsed to index. If the issue is parseed content size, how can I configure my Solr to fix the problem?

如果错误是由其他原因引起的,请有人帮助我确定错误的原因和解决方法?

If the error is from something else, can someone please help me identify what it is and how to fix it?

推荐答案

对于那些经历了我所经历的事情的人,我认为我应该发布解决我所遇到的问题的方法.

For those who experience something that I have experienced, I thought I would post the solution to the problem that I was having.

最重要的是,Apach Nutch 1.12似乎不支持Apache Solr6.X.如果您查看Apache Nutch 1.12发行说明,则他们最近在Nuch 1.12中添加了支持Apache Solr 5.X的功能,并且不包括对Solr 6.X的支持.因此,我决定使用Solr 5.5.3代替Solr 6.2.1.因此,我安装了Apache Solr 5.5.3以与Apache Nutch 1.12一起使用

Fist of all, Apach Nutch 1.12 does not seem to support Apache Solr 6.X. If you check out Apache Nutch 1.12 release note, they recently added feature to support Apache Solr 5.X to Nuch 1.12, and the support for Solr 6.X is NOT included. So, instead of Solr 6.2.1, I decided to work with Solr 5.5.3. Thus, I installed Apache Solr 5.5.3 to work with Apache Nutch 1.12

正如Jorge Luis指出的那样,Apache Nutch 1.12有一个错误,并且在与Apache Solr一起使用时会出错.他们将修复该错误并在某个时候发布Nutch 1.13,但我不知道何时发布,所以我决定自己修复该错误.

As Jorge Luis pointed out, Apache Nutch 1.12 has a bug, and it gives error when it works with Apache Solr. They will fix the bug and release Nutch 1.13 at some point, but I don't know when that would be, so I decided to fix the bug myself.

我得到此错误的原因是,首先调用CleaningJob.java(Nutch)的close方法,然后再调用commit方法.然后,引发以下异常:java.lang.IllegalStateException:连接池关闭.

The reason why I got the error is because the close method in CleaningJob.java(of Nutch) is invoked first and then the commit method. Then, the following exception is thrown: java.lang.IllegalStateException: Connection pool shut down.

修复实际上非常简单.要了解该解决方案,请转到此处: https://github.com/apache/nutch/pull/156/commits/327e256bb72f0385563021995a9d0e96bb83c4f8

The fix is actually quite simple. To learn the solution, go here: https://github.com/apache/nutch/pull/156/commits/327e256bb72f0385563021995a9d0e96bb83c4f8

如您在上面的链接中看到的,您只需要重新定位"writers.close();"即可.方法.

As you can see in the link above, you simply need to relocate "writers.close();" method.

顺便说一句,为了解决该错误,您将需要Nutch scr软件包而不是二进制软件包,因为您将无法在Nutch二进制软件包中编辑CleaningJob.java文件.修复之后,运行ant,您就大功告成.

By the way, in order to fix the error, you would need the Nutch scr package NOT the binary package because you won't be able to edit CleaningJob.java file in Nutch binary package. After the fix, run ant, and you are all set.

修复后,我不再收到错误消息!

After the fix, I no longer get the error!

希望这对遇到我所遇到问题的人有帮助.

Hope this helps anyone who is facing the problem that I was facing.

这篇关于带有Apache Solr 6.2.1的Apache Nutch 1.12给出错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆