Nutch不会抓取除seed.txt中指定的URL以外的URL [英] Nutch not crawling URLs except the one specified in seed.txt
问题描述
我正在使用Apache Nutch 1.12,而我尝试抓取的URL类似于 https://www.mywebsite.com/abc-def/这是seed.txt文件中的唯一条目.由于我不希望爬网URL中没有"abc-def"的任何页面,因此我将以下行放在regex-urlfilter.txt中:
I am using Apache Nutch 1.12 and the URLs I am trying to crawl is something like https://www.mywebsite.com/abc-def/ which is the only entry in my seed.txt file. Since I don't want any page to be crawl that doesn't have "abc-def" in the URL so I have put the following line in regex-urlfilter.txt :
+^https://www.mywebsite.com/abc-def/(.+)*$
当我尝试运行以下爬网命令时:
When I try to run the following crawl command :
**/bin/crawl -i -D solr.server.url=http://mysolr:3737/solr/coreName $NUTCH_HOME/urls/ $NUTCH_HOME/crawl 3**
它仅对一个seed.txt网址进行爬取和索引,而在第二次迭代中,它只是说:
It crawl and index just one seed.txt url and in 2nd iteration it just say:
Generator: starting at 2017-02-28 09:51:36
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
当我更改regex-urlfilter.txt以允许所有内容(+.)时,它开始为 https:/上的每个URL编制索引./www.mywebsite.com 当然是我不想要的.
When I change the regex-urlfilter.txt to allow everything(+.) it started indexing every URL on https://www.mywebsite.com which certainly I don't want.
如果有人碰巧遇到相同的问题,请分享您的解决方法.
If anyone happen to have the same problem, please share how you get past it.
推荐答案
在过去2天尝试了多种方法后,该方法可以正常工作.以下是解决方法:
Got that working after trying multiple things in last 2 days.Here is the solution:
由于我正在爬网的网站非常沉重,因此nutch-default.xml中的属性将其截断为65536字节(默认).不幸的是,我要爬网的链接未包含在所选部分中,因此导致了并不是在抓取它.当我通过在nutch-site.xml中将以下值更改为无限制时,它开始抓取我的页面:
Since the website I was crawling was very heavy, the property in nutch-default.xml was truncating it to 65536 bytes(default).The links I wanted to crawl unfortunately didn't get included in the selected part and hence nutch wasn't crawling it.When I changed it to unlimited by putting the following values in nutch-site.xml it starts crawling my pages :
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
这篇关于Nutch不会抓取除seed.txt中指定的URL以外的URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!